PRIDE-Archive / xi-mzidentml-converter

Apache License 2.0
0 stars 1 forks source link

Support for the first Mascot submission #71

Open ypriverol opened 1 month ago

ypriverol commented 1 month ago

@sureshhewabi is working on the first Mascot submission with some data from the Mascot team. An issue was found while parsing the MGF, already an issue in pyteomics has been created https://github.com/levitsky/pyteomics/issues/153.

This issue is related to the support of the main search engines #63

colin-combe commented 1 month ago

first Mascot submission with some data from the Mascot team

Great!

Could you share the mzIdentML file with me, pls. There was something I wanted to check in it. (Something I thought I saw in another Mascot generated mzIdentML recently, to do with repetition of the same peptide).

sureshhewabi commented 1 month ago

@colin-combe I copied the files to dropbox and I will share you the FTP details

colin-combe commented 1 month ago

Thanks.

I think there's something not right in these mzIdentML files, but not something that will stop them working in our system.

The mzid specification states:

The combination of Peptide sequence and modifications MUST be unique in the file. (Section 6.48.)

There is a complication re peptide uniqueness when it comes to the crosslinked peptides. Setting that aside and just looking at the 'linear' (uncrosslinked) peptides, it seems in the Mascot output they are not unique but instead repeated everytime they are identified.

This is OK for us, it works, but its sub-optimal. It bloats the files unnecessarily, then our database, and then the xiview web page takes longer to load because it is being sent duplicates of all the peptides.

I think it's worth taking this up with them to see what they say. (@vrkosk ?)

vrkosk commented 1 month ago

@colin-combe Do you mean cases like:

    <Peptide id="peptide_162_1">
      <PeptideSequence>SPDKPGK</PeptideSequence>
    </Peptide>
    <Peptide id="peptide_163_1">
      <PeptideSequence>SPDKPGK</PeptideSequence>
    </Peptide>
    <Peptide id="peptide_164_1">
      <PeptideSequence>SPDKPGK</PeptideSequence>
    </Peptide>

I see what you mean. Mascot is currently taking a very PSM-centric view. The above are duplicate identifications of the same peptide in sequential Mascot queries. I agree it would be better if Mascot collated them into something like:

    <Peptide id="peptide_SPDKPGK">
      <PeptideSequence>SPDKPGK</PeptideSequence>
    </Peptide>

And where peptide_ref="peptide_162_1" is used in , replace it with peptide_ref="peptide_SPDKPGK". This would reduce duplication in elements as well, which currently repeat the start and end position and pre and post residues needlessly:

    <PeptideEvidence id="PE_162_1_1_EWas03_0_236_242" start="236" end="242" pre="R" post="G" peptide_ref="peptide_162_1" isDecoy="false" dBSequence_ref="DBSeq_1_EWas03" />
    <PeptideEvidence id="PE_163_1_1_EWas03_0_236_242" start="236" end="242" pre="R" post="G" peptide_ref="peptide_163_1" isDecoy="false" dBSequence_ref="DBSeq_1_EWas03" />
    <PeptideEvidence id="PE_164_1_1_EWas03_0_236_242" start="236" end="242" pre="R" post="G" peptide_ref="peptide_164_1" isDecoy="false" dBSequence_ref="DBSeq_1_EWas03" />

I'll add a change request.

colin-combe commented 1 month ago

yes, cases like that.

colin-combe commented 1 month ago

it's a little more complicated with the crosslinked peptides, where it's the crosslinked pair of peptides that is meant to be unique

colin-combe commented 1 month ago

is it weird that in these files there are things like: `

` so the rank is 3, but it has passThreshold = true? @vrkosk ?
colin-combe commented 1 month ago

...i guess it's probably meant to be like this, guess there's no reason why not

vrkosk commented 1 month ago

A Mascot PSM is significant if expect value < sigthreshold. This is encoded as passThreshold = true in the mzIdentML export. It's perfectly possible for the rank 1, 2 and 3 matches to have a similar score and, thus, similar expect values, all of which are statistically significant. Because the ranks are ordered by score, if rank 3 has passThreshold = true, then ranks 1 and 2 must also have passThreshold = true. (I don't think this is a rule that needs to be coded anywhere, just pointing it out here.)

colin-combe commented 1 month ago

ok, thanks. I didn't forget about this btw - https://github.com/Rappsilber-Laboratory/build-xiview/issues/87

ypriverol commented 1 month ago

@colin-combe @sureshhewabi, as soon as we are sure these files will work, let me know so we can prepare the submission for the PRIDE Archive. Excellent work, Thanks @vrkosk for your support, the Mascot team has always been responsive and helpful. Thanks.