HUPO-PSI / mzIdentML

Repository for mzIdentML and the corresponding examples
23 stars 24 forks source link

Redundancy of cross-links in the <Peptide> element #100

Closed mhoopmann closed 5 years ago

mhoopmann commented 6 years ago

I have a few questions with mzIdentML 1.2 and cross-linking implementation. I have also been using the validator to try to understand what is going on when implementing 1.2 on cross-linked database search results generated in-house.

I have collected multiple spectra for the sample precursor ion, and the search results present the same cross-linked pair of peptides as a PSM. In other words: multiple spectra with identical results. For each of these SpectrumIdentificationResult, I have two SpectrumIdentificationItem with a cvParam "accession" of MS:1002511 and a "value" with the unique identifier for the cross-link (donor and acceptor) in my Peptide element.

When I have 3 or more of these spectra with identical PSMs, the validator says I'm using the value of my cvParam in the SpectrumIdentificationItem too many times. It appears that the validator doesn't like that I'm using the same Peptide elements with more than one SpectrumIdentificationResult when specifically referring to cross-linked peptides.

So my questions are:

  1. Does this mean that all cross-linked peptides must be listed redundantly in the Peptide elements?

  2. Does this mean the unique "value" in the cvParam of the Modification element for the cross-linked site of these redundant Peptides elements is simply to circumvent the definition of a Peptide Element (which explicitly forbids redundancy) as presented in chapter 6.48 of the mzIdentML schema?

  3. Or is this a bug in the validator?

  4. And should the validator be able to identify these redundancies when they happen in only 2 spectra? (I am assuming that the validator assumes these particular cases are isotope labelled cross-linkers, which they are definitely not in my results).

  5. Probably a question for a different topic, but is all this redundancy necessary? For example, it appears I have to list the same peptide sequence, with the same modifications, multiple times for every partner peptide it may be linked to. Worse still, its status of donor or acceptor can alternate based on its partner peptide: but this has no bearing on any chemical significance, and is merely an artifact of sequence string length. And thus the sequence is again(!!) redundant in the Peptide element list for simply flipping the cvParam on the same Modification.

Any clarity here would be greatly appreciated. Perhaps I'm just misunderstanding something very basic.

Thanks, Mike

andrewrobertjones commented 6 years ago

To me this sounds like a bug in the validator, can you post an example mzid file for us to take a look at

mhoopmann commented 6 years ago

Sure thing: https://regis-web.systemsbiology.net/PublicDatasets/mikeh/mzID/test.mzid

Please note that our in-house implementation is still a work-in-progress for cross-linking. That being said, the redundancy issue is quite apparent.

Thanks much!

germa commented 6 years ago

Dear Mike,

Sorry for the late reply. In general we want to avoid redundancy in mzIdentML whenever possible. Therefore also cross-linked peptides should only be listed once in the elements.

  1. For example in the following I listed all SpectrumIdentificationItems with value=”2301” for the cvParam MS:1002511. The first two reference the peptides "Pep5238" and "Pep5237", which again have for the cvParams MS:1002509 and MS:1002510 the correct value=”2301”.

  2. But the next two SpectrumIdentificationItems with value=”2301” for the cvParam MS:1002511 reference the peptides "Pep5176" and "Pep5175", which have another value=”2271” for their cvParams MS:1002509 and MS:1002510.

  3. The last two SpectrumIdentificationItems with value=”2301” for the cvParam MS:1002511 reference the peptides "Pep5027" and "Pep5028", which have another value=”2201” for their cvParams MS:1002509 and MS:1002510.

So I think the problem is that the values are not correctly matched for the cases 2 and 3.

NNFTYVGNGAHLKYYKDCQYR HGIKANGSSM LDVPKQSSQR FQNLDKK CNGPAGTVCLTTFKDVCANR NHEEEMKSMQGSSR Best regards Gerhard
germa commented 6 years ago

Hoopmann_ERROR_Report.docx

mhoopmann commented 6 years ago

Hi Gerhard, Thanks much for responding. I apologize for the bug in my mzID file that appeared in some cases, but the issue remained for many of the other SpectrumIdentificationItems for which the values were correctly labeled. To illustrate the problem, I'm providing an updated version of the file with the errors that you mention fixed, but the validator still reports problems.

https://regis-web.systemsbiology.net/PublicDatasets/mikeh/mzID/test2.mzid

I appreciate your time with this issue!

Thanks, Mike

colin-combe commented 6 years ago

~for what its worth, I also think the file is correct in this respect but get unwarranted errors from v1.4.29 of the validator~

@mhoopmann - its missing its start and end values for PeptideEvidences, they're required if not from a de novo search, right? (https://github.com/HUPO-PSI/mzIdentML/issues/103)

colin-combe commented 6 years ago

I listed all SpectrumIdentificationItems with value=”2301” for the cvParam MS:1002511. The first two reference the peptides "Pep5238" and "Pep5237", which again have for the cvParams MS:1002509 and MS:1002510 the correct value=”2301”

I could easily be mistaken, but I can't see anything about this in the specification document? Also, I think in the cross-linking example given in the specification doc these values do not match?

All I can see is:

the value acts as a local identifier within the SpectrumIdentificationResult to group these two elements together

I think the problem with the validator is that its restricting repeat values of MS:1002511 to a maximum of two within the SpectrumIdentificationList rather than two within the SpectrumIdentificationResult? (seems consistent with the errors it gives)

germa commented 6 years ago

Should be solved now by new version mzIdentMLValidator_GUI_v1.4.30-SNAPSHOT.zip

colin-combe commented 6 years ago

Should be solved now by new version mzIdentMLValidator_GUI_v1.4.30-SNAPSHOT.zip

No, I don't think so.

I think the validator is now passing files where there are 4 SpectrumIdentificationItems containing an MS:1002511 cv param with the same value within a single SpectrumIdentificationResult.

So the validator has just changed from failing valid files to passing invalid files?

colin-combe commented 6 years ago

@germa - yeah, sorry, there were some things I was unaware of when i wrote that previous post, see #105

germa commented 6 years ago

I think the problem is that there can be not only pairs of SpectrumIdentificationItems (SII's) containing a MS:1002511 CvParam with the same value in the SII's of the same SpectrumIdentificationResult (SIR), for two reasons:

First, if one uses isotope labelled linkers there can occur four SII's with the same value.

Secondly, the SII's can also occur multiple times because of the charge states.

Therefore when you identified the same peptide in n different charge states - as in your example file test2.mzid - then one can have in total 2n SII's (if no isotopically linkers were used) resp. 4n SII's (if isotopically linkers were used) with the same value for the CvParam MS:1002511 (cross-link spectrum identification item).

Unfortunately all these cases are not explicitly considered/distinguished in the spec doc yet, what causes the confusion here, see also https://github.com/HUPO-PSI/mzIdentML/issues/105 and https://github.com/Rappsilber-Laboratory/xiSPEC_ms_parser/issues/12

colin-combe commented 6 years ago

hi - thanks for reply, I agree that some cross-linking use cases are not fully worked out in the current version of the spec.

I'm now confused about how charge states are dealt with even in the simple case:

Secondly, the SII's can also occur multiple times because of the charge states.

I don't know, but it looks to me like this isn't the case? Statements in spec doc like "chargeState MUST be identical over both SII elements" imply to me that if charge state was different it would be a different pair of SII elements?

Could you point to something that indicates there can be multiples of two SII's with the same MS:1002511 value and different charge states?

colin-combe commented 6 years ago

Hi again,

I'm going to mention some people who might be able to help with the preceding question - whether the SII pairs (linked by same MS:1002511value) can occur multiple times because of different charge states:

@lutzfischer , @lars-kolbowski, @andrewrobertjones

Any ideas folks?

@germa, as before, if you can point to something that indicates it works as you say then that would also help resolve this.

This is more specifically issue #105 but the discussion is happening here,

best wishes, Colin

germa commented 6 years ago

OK, I think you're right: according to the explanation for Feature D in Figure 3 in the spec doc "The ... and chargeState MUST be identical over both SII elements," the MS:1002511 values for different charge states must be indeed different. So I will change the validator to check this and to allow only 2 resp. 4 (for isotope labelled linkers) SII's with the same value for the MS:1002511 cvParam.

germa commented 5 years ago

adapted in version mzIdentMLValidator_GUI_v1.4.32-SNAPSHOT