MSGFPlus / msgfplus

MS-GF+ (aka MSGF+ or MSGFPlus) performs peptide identification by scoring MS/MS spectra against peptides derived from a protein sequence database.
Other
76 stars 36 forks source link

how to "properly" train new scoring model #80

Open gsaxena888 opened 4 years ago

gsaxena888 commented 4 years ago

I had some questions regarding training a new scoring model:

  1. As I understand it, there's no direct way to specify fragment tolerance, as it it's automatically learned during the training phase, correct?
  2. Is it "acceptable" to use msgfplus to create an initial mzid file for later use in training? Or is it "better" to use another search engine (eg X!Tandem with native scoring or MSFragger) to generate the mzid file that ScoringParamGen will use for training? One "advantage" of using another search engine is that I could specify the known fragment tolerance, but maybe there's an alternative acceptable way using only msgfplus? Also, I'm assuming that by using a different search engine to generate the training mzid dataset, one minimizes overfitting?
  3. Even if we use a separate search engine (eg MSFragger) to generate the initial mzid file for training, is it considered "bad practice" (ie misleading overly good results due to overfitting) to then search with msgfplus the SAME ORIGINAL mgf files that MSFragger used when generating the mzid files for msgfplus training?
  4. When supplying the mzid file to msgfplus for training, should one only supply the "extremely good" peptide ids, ie <= 1% FDR? Or, should one supply everything?
  5. When supplying the mzid file to msgfplus for training, should one remove the decoy hits (even if they fell below the 1% FDR threshold)?
  6. In general, is there any doc on how to best use/create the scoring model that might answer or elaborate on some of the above questions? (The only thing I saw was this: https://msgfplus.github.io/msgfplus/ScoringParamGen.html)
  7. After training, does msgfplus have a "fixed PPM tolerance" for fragments or is something more complicated, eg perhaps the tolerance is a function of multiple parameters, such as charge, intensity, maybe even RT etc.

(In case it helps, in all of the cases above, I'm using either one of the more recent orbitraps, such as Q-Exactive HFX, Fusion, Lumos etc.; however, the ms2 resolution, and consequently ms2 tolerance, is NOT overly high, eg it's at ~15k or even ~7.5k, so I think the "default" scoring parameters settings that come with msgfplus for "HCD, high res" are probably not correct for those scenarios, right?)

FarmGeek4Life commented 4 years ago

For some answers:

  1. MSGF+ training is designed to use MSGF+ results.
  2. After training MSGF+ with MSGF+ results, you can re-search those files with MSGF+; the bad practice would be using those search results to validate the results of training.
  3. Supply everything, let MSGF+ filter the results.
  4. Do not remove the decoy hits.
  5. There is not another document on creating/using scoring models in MSGF+.

We have been selecting "Q-Exactive" for Q-Exactive Plus and Q-Exactive HF, while we have been using "High-res/Orbitrap/FTICR" for Q-Exactive HFX and Fusion Lumos, with no significant issues.

The most common reasons to build new scoring parameters:

alchemistmatt commented 4 years ago

I have never attempted to train a new scoring model for MS-GF+. Bryson outlined above some of the reasons that you might want to do this, but I'm not certain your situation qualifies. I suggest running MS-GF+ with each of the available scoring modes and seeing which works the best (i.e., which gives the most high scoring results). Modes to try are: -inst 1 -inst 2 -inst 3

gsaxena888 commented 4 years ago

Thank you @FarmGeek4Life and @alchemistmatt . Quick confirmatory question: so the fact that I'm using 15k resolution (and sometimes even 7.5k resolution) for the MS2 fragments is NOT a (strong?) reason that one would need to retrain? (I'm also deisotoping and decharging and doing some other things to remove spurious peaks, but it sounds like none of these signal pre-processing steps + lowish MS2 resolution would require retraining?)

If so, how does msgfplus know what ms2 fragment tolerance to use for different researchers' experiments, since different researchers' projects may have vastly different ms2 resolutions/tolerances (eg ranging from say 7.5k to 60k?)

(In case it helps, I'm doing something similar to DIAUmpire -- as in, the true MS data is technically collected using a DIA protocol on a Lumos or QExactive HFX, but like DIA Umpire, I'm creating pseudo mgf files using deconvolution algorithms, and it is this pseudo mgf file that I finally submit to a search engine, such as msgfplus.)

chambm commented 4 years ago

Having pseudo-spectra from deconvolution of DIA definitely sounds like a good reason to at least try training a new model and seeing if it improves significantly over existing ones. Did you end up doing so?

gsaxena888 commented 4 years ago

I never did try the training, as I pursued a different line of research....

On Tue, Jun 2, 2020 at 5:43 PM Matt Chambers notifications@github.com wrote:

Having pseudo-spectra from deconvolution of DIA definitely sounds like a good reason to at least try training a new model and seeing if it improves significantly over existing ones. Did you end up doing so?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/MSGFPlus/msgfplus/issues/80#issuecomment-637822451, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADNH3MALJUSSK72S4YBU73LRUVW7HANCNFSM4JEGDBNQ .

-- Gautam Saxena President & CEO Integrated Analysis Inc.

Making Sense of Data.™ Biomarker Discovery Software | Bioinformatics Services | Data Warehouse Consulting | Data Migration Consulting www.i-a-inc.com http://www.i-a-inc.com/ gsaxena@i-a-inc.com (301) 760-3077 office (240) 479-4272 direct (301) 485-7364 fax