compomics / compomics-utilities

Open source Java library for computational proteomics
http://compomics.github.io/projects/compomics-utilities.html
29 stars 17 forks source link

MzIdentMLIdfileReader questions (help needed) #18

Closed wolski closed 7 years ago

wolski commented 7 years ago

I do having question and it would be great if you could help.

I would like to retrieve all PSM's (peptide spectrum matches - not only the best hits) from an mzid file.

I am parsing a myrimatch mzid file using the MzIdentMLIdfileReader from.

This is my code: File f = new File(...) IdfileReader temp = new MzIdentMLIdfileReader(f); LinkedList tmp = temp.getAllSpectrumMatches(null, null, null,true);

Looking at a SepctrumMatch in the debugger I see that all of them have the SAME spectrumKey which equals to the name of the mzML file parsed. Which leaves me with the question how to match the SM with the spectrum?

The assumptionMap for each SpectrumMatch will contain the PSM's, I would think. However, all the SpectrumMatch(es) have an assumptionMap of length 1. I would expect to see more than one peptide matching a Spectrum.

Thank you for your help. Witold

mvaudel commented 7 years ago

Hi Witold and thank you for reporting this issue.

Can you verify that your mzIdentML file contains spectrum titles? It is a requirement for us to be able to distinguish them, missing spectrum titles might explain that you cannot link back to the original title :) Spectrum titles are indicated by the CV term "MS:1000796". If there is none, I can try to change the parser to use an index, scan number or anything similar.

If there are multiple PSMs per spectrum, they should be all in this map. Note that with only one search engine the length of the map will be 1 independently of the number of PSMs since they are indexed by algorithms and then by score if I recall correctly. It is hard to tell why only one PSM is retained in your case. Would it be possible for you to share the file with us?

Best regards,

Marc

hbarsnes commented 7 years ago

Hi Witold,

The assumptionMap for each SpectrumMatch will contain the PSM's, I would think. However, all the SpectrumMatch(es) have an assumptionMap of length 1. I would expect to see more than one peptide matching a Spectrum.

I just pushed a fix that should solve this problem: https://github.com/compomics/compomics-utilities/commit/b57abed2bf9d7ff87af2f9652b0bdc1feddf07d3. Would be great if you could test it on your end?

Looking at a SepctrumMatch in the debugger I see that all of them have the SAME spectrumKey which equals to the name of the mzML file parsed. Which leaves me with the question how to match the SM with the spectrum?

This is due to the fact that we depend on the spectrum title to map back to the correct spectrum in the mgf file (which is the only spectrum format we currently support in PeptideShaker). On line 1032 the temp spectrum key is created using the spectrum file name plus "temp", which is then later replaced on line 1257 if the spectrum title has been found in the CV terms.

But as far as I can tell the MS:1000796 "spectrum title" CV term is not used by MyriMatch (not even for mgf input). Each spectrum match missing a spectrum title therefore also gets a spectrum number (line 1037), that we can later use to figure out which spectrum the given index refers to. But this only happens if the spectrum id is index based, such as spectrumID="index=1477". Which is not the case when using mzIdentML as input to MyriMatch it seems.

So the question becomes what to use when none of the above is available. One option would perhaps be to use the spectrum id value anyway? In your dataset this would be something like spectrumID="controllerType=0 controllerNumber=1 scan=3". Would be great of you could make the required code change on your end (around line 1257, or perhaps around line 1024) to use spectrumID if spectrumTitle is missing and see if this enables you to link back to the correct spectrum in your mzML file?

However, as you can see from the mzIdentML documentation (http://www.psidev.info/sites/default/files/mzIdentML1.1.0.doc, page 8) the specturmID tag can take many forms, and I'm not sure how easy it is to support all of them in a generic way? Which is why we started with the simple index option we needed as used for mgf. See TODO on line 1024. ;-)

Best regards, Harald

hbarsnes commented 7 years ago

Hi Witold,

Can you give me an update regarding whether my fix to the MzIdentMLIdfileReader solved the problems you detected so that we can potentially close this issue?

Best regards, Harald

hbarsnes commented 7 years ago

Hi Witold,

As we didn't hear back from you in a long while, we'll assume that this means that the issue has been resolved. If this is not the case please let us know and we'll reopen the issue.

Best regards, Harald