MSGFPlus / msgfplus

MS-GF+ (aka MSGF+ or MSGFPlus) performs peptide identification by scoring MS/MS spectra against peptides derived from a protein sequence database.
Other
76 stars 36 forks source link

splitting mgf into multiple files #79

Open gsaxena888 opened 4 years ago

gsaxena888 commented 4 years ago

I believe that some search engines, such as X!Tandem, will not give the same results if one splits an mgf into "n" files and searches them separately and then later concatenates the results. (I believe this is because X!Tandem makes use of which proteins are identified by one set of spectra/peptides to help boost/penalize the score of other spectra/peptides.)

However, with msfg-plus, is there are reason why it would be dangerous to split an mgf file into "n" parts and search of them separately and then later recombine the results (eg by simple concatenation of tsv files)?

On a different but partly related note, the input mgf or mzML files sometimes specify a specific precursor (ie ms1) tolerance for each spectrum, and this tolerance values is different for different spectra. Is there a (good?) way to get msgfplus to use the precursor tolerance settings defined per spectrum (in an mgf file) instead of having a single, global precursor tolerance for the whole mgf file?

FarmGeek4Life commented 4 years ago

This could be done, however it will affect the output 'QValue' and 'PepQValue' scores because those are calculated based on full set of results. If you are not using those columns, then there should be no difference.

alchemistmatt commented 4 years ago

Clarifying, splitting a .mgf into multiple parts and searching, you will get different QValue and PepQValue values, since those are based on observed forward and reverse proteins. In practice, what we do is split large FASTA files into parts, then search the same .mgf file (or .mzML) file multiple times, once for each split FASTA file. We then re-combine the results using MzidMerger, https://github.com/PNNL-Comp-Mass-Spec/MzidMerger/releases Still, this too has the possible downside of QValue and PepQValue differences (unless MzidMerger recomputes those ... I forget).

Regarding the second topic, a per-spectrum tolerance would require a code change in MS-GF+. It's probably doable, but we would need you to provide us an example .mgf file and .mzML file that has per-spectrum tolerances; I have not yet seen a file like that. If you are able to send us a file, send an e-mail to proteomics@pnnl.gov and we will send you a link to https://fx.pnnl.gov/ which you can use to send us the files.

gsaxena888 commented 4 years ago

@alchemistmatt Thank you. I will send you a file hopefully by next week (I'm constructing it!).