MSGFPlus / msgfplus

MS-GF+ (aka MSGF+ or MSGFPlus) performs peptide identification by scoring MS/MS spectra against peptides derived from a protein sequence database.
Other
76 stars 36 forks source link

Read ion mobility from mzML and write to mzIdentML #32

Closed chambm closed 6 years ago

chambm commented 6 years ago

The PSI CV has been tweaked to allow ion mobility terms to be put in the mzIdentML at the SpectrumIdentificationResult the same way scan start time already could be: https://sourceforge.net/p/psidev/mailman/message/36317835/

How hard would it be to get MS-GF+ to carry this attribute through to the output mzIdentML?

FarmGeek4Life commented 6 years ago

I'm assuming that this will only matter for mzML input? (I can't think of how it would be encoded in other supported spectrum input formats). Are you needing all three CV terms that have the "is_a: MS:1002892 ! ion mobility attribute" relationship ('MS:1001581 FAIMS compensation voltage', 'MS:1002476 ion mobility drift time' and 'MS:1002815 inverse reduced ion mobility'), or just the ion mobility drift time? I ask because it would be a bit easier to add just one, and because I don't know if the library MS-GF+ uses to read mzML has the relationship mappings that would let it just read all cvParams that can be PSM-level attributes; it currently reads data by referring to specific accession numbers.

The important classes here are:

I think coding this could be done in less than an hour, testing the functionality is a different story.

chambm commented 6 years ago

Mostly mzML for now but parsing it from MGF title is possibility as well (although getting the specific CV term and units would be tricky).

I think all 3 IMS types should be supported, yes. An amusing implementation would be just to take the whole cvParam element (with one of the 3 supported accessions) as a string and plop in the mzIdentML rather than trying to parse it into value, type, and units.

FarmGeek4Life commented 6 years ago

Well, I don't think the mzML parsing/mzid writing library used will let me just copy the whole string from one to the other, although I could possibly just store the cvParam object(s) to transfer them from one to the other; parsing the value, type, and units isn't hard due to that library.

chambm commented 6 years ago

Indeed. And the jmzml and jmzidml models use distinct cvParams classes so you can't plop one into the other. So it seems 2 values will have to be carried through: the accession and the value (the unit is implied by the accession, i.e. the mobility value type).

Let me know if you want me to test it. Thanks!

FarmGeek4Life commented 6 years ago

I just made a commit that should provide this functionality. Do you want me to provide a binary, or are you okay with checking out and compiling the current master branch?

chambm commented 6 years ago

I don't think I ever set up the build environment for this, so a binary would be nice (or I can wait for the next release).

FarmGeek4Life commented 6 years ago

https://github.com/MSGFPlus/msgfplus/releases/tag/IMS_CV_Preview

chambm commented 6 years ago

Didn't seem to work for a Waters HDDDA. Here's the input and output and the command I used. HDDDA.zip

I just used a random FASTA I had around. Just needed to see one SpectrumIdentificationResult and that's all I got. Might need to be a pretty large FASTA to get a random hit. I'm not sure what species this sample is, or if it's even peptides. :)

The first test I did was a legit search with TIMS PASEF data, but that failed at the end (see my PR to fix that).

FarmGeek4Life commented 6 years ago

https://github.com/MSGFPlus/msgfplus/releases/tag/IMS_CV_Preview2

That HDDDA mzML file says that all spectra are profile, which MS-GF+ skips. However it does have an internal evaluation that might be saying that the spectra are centroided (it looks for a median difference of >=50 PPM between m/zs of consecutive peaks in the spectra), if you didn't see an error saying that it "skipped spectrum x since it is not centroided".

Overall, if you're able to get meaningful information out of a MS-GF+ search on IMS/TIMS data, that would be great since MS-GF+ was never designed to work on such data (and if it does work reasonably well, then we might need to introduce some new scoring models to properly accommodate it).

chambm commented 6 years ago

I'm still not seeing the CV term carried through. It says the build is from 6-28. Are you giving me the right binary?

I know the data is ridiculous:

I'm not sure what species this sample is, or if it's even peptides.

All I wanted was a single SIR to test whether the cvParam is getting carried through. It doesn't need to be a legit result. Ironically, when I ran the default CWT peak picker on this data so that they really were centroided, I got NO results. Only when I turn the SNR down to 0 then it gets better. The ion mobility spectra are very sparse even in profile mode so that's probably why.

FarmGeek4Life commented 6 years ago

Well, the date that MS-GF+ outputs is only manually updated, and I haven't updated it yet. I will have to try that file with some fasta file here, while using the debugger, to figure out exactly what's going on.

FarmGeek4Life commented 6 years ago

Well, found the main bug, which also affects other searches: MS-GF+ originally only checked the scanList in mzML spectra for the "[Thermo Trailer Extra]Monoisotopic M/Z:" userParam, so there was a check to only enter an if statement if there was at least one userParam in scanList:scan[0]. This bug also meant that the scan start time would not be output for a search on data from, say, an Agilent QTOF.

https://github.com/MSGFPlus/msgfplus/releases/tag/v2018.07.17 fixes it, I did see the desired cvParam in the single search result I got (searching against a human refseq fasta file I had on hand).

chambm commented 6 years ago

Excellent. Bruker TIMS results now have both ion mobility and scan time. I hadn't realized they were missing scan time previously. Thanks!