data format matters? - Githubissues

MSGFPlus / msgfplus

MS-GF+ (aka MSGF+ or MSGFPlus) performs peptide identification by scoring MS/MS spectra against peptides derived from a protein sequence database.

Other

72 stars 36 forks source link

data format matters? #142

Open hxxhust163 opened 1 year ago

hxxhust163 commented 1 year ago

Dear Mrs

Thanks for your work of MSGF+, it is really an wonderful search engine. But I encountered confusion in the data format used for searching. I used MSGF+ in immunopeptidomics, at first, I converted the raw file into centroid mzML, 64G RAM and 64 threads used in searching and it took nearly 6 hours to finished for only one mzML file. Because of the long time, I converted the same raw file into centroid mgf, searched again using the same parameters and finished search in about 20 minutes for one file. However, the output mzid files of the two searches are not the same. Exactly, the identified PSMs of the mzML file is double of that identified of the mgf file. Why is the result? It seems very strange.

hxxhust163 commented 1 year ago

I used Comet search engine to do the same thing and I got two same identification results, which means the data format has nothing to do with the identification results.

FarmGeek4Life commented 1 year ago

1) Are you sure you used the same parameters for the MS-GF+ mzML search and MGF searches? Generally in our testing the results do match, or are very similar (see below). But, you may need to ensure that the search parameters are the exact same by overriding some defaults that change a parameter based on information available in an mzML file that is not available in an MGF file. See https://msgfplus.github.io/msgfplus/MSGFPlus.html, in particular '-m FragmentMethodID' 2) One potential reason for a difference in results in MS-GF+ for mzML input vs. other formats is particular for Thermo Orbitrap instruments: MS-GF+ does read the 'Thermo Trailer Extra: Monoisotopic m/z' value that Proteowizard/MSConvert writes to the mzML file when converting Thermo Orbitrap files from .raw, and uses that instead of the 'selected m/z' value that is used for all other instruments. But, this should only lead to minor differences.

hxxhust163 commented 1 year ago

Thanks very much! I figure out the problem. The data used in my search was acquired from a Q Exactive instrument and fragmented by HCD. In my initial run, 'FragmentationMethodID=0, InstrumentID=3' was used in my parameters. So, when I searched the mzML file, it will read the fragmentation(HCD) info from the file. But, when I searched the mgf file used the same parameters, it will recognize the info as CID by default, as there was no fragmentation info in mgf file. So, that is the reason for the problem. By the way, I have another question. For example, for the same mzML data mentioned above, 'FragmentationMethodID=0, InstrumentID=3' and 'FragmentationMethodID=3, InstrumentID=3' which is better to use? As I ran use the params separately, and got two different results. Thanks in advance!

alchemistmatt commented 1 year ago

In theory, the only way you should get different results for FragmentationMethodID=0 vs. FragmentationMethodID=3 is if the file has a mix of HCD and non-HCD spectra; see the comment in this example parameter file: https://github.com/MSGFPlus/msgfplus/blob/master/docs/ParameterFiles/MSGFPlus_PartTryp_MetOx_20ppmParTol.txt#L15

In reality, there might be some unexpected side effect that I don't know about. I would suggest using the option that gives more filter-passing results.