accession attribute in DBSequence should be unique?

smdb21 commented 7 years ago

When parsing example file https://github.com/HUPO-PSI/mzIdentML/blob/master/examples/1_2examples/crosslinking/xiFDR-CrossLinkExample.mzid, I find these 2 protein entries as DBSequence elements:

<DBSequence searchDatabase_ref="SDB_4299_203" accession="P02768-A" id="dbseq_P02768-A_target" name="ALBU_HUMAN Serum albumin active OS=Homo sapiens GN=ALB PE=1 SV=2">
  <cvParam cvRef="PSI-MS" accession="MS:1001088" name="protein description" value="ALBU_HUMAN 
Serum albumin active OS=Homo sapiens GN=ALB PE=1 SV=2"></cvParam>  
</DBSequence>  
<DBSequence searchDatabase_ref="SDB_4299_203" accession="P02768-A" id="dbseq_P02768-A_decoy" name="ALBU_HUMAN Serum albumin active OS=Homo sapiens GN=ALB PE=1 SV=2">  
    <cvParam cvRef="PSI-MS" accession="MS:1001088" name="protein description" value="ALBU_HUMAN Serum albumin active OS=Homo sapiens GN=ALB PE=1 SV=2"></cvParam>  
</DBSequence>

Although the protein entries are different (one is the decoy entry of the other), the accession attribute is the same. My question is: should the accession attribute be unique? In the specification document says this about the accession:

The unique accession of this sequence

This caused my a problem because I am collecting all proteins in a map in which the key is the accession.

What do you think?

julianu commented 7 years ago

I would definitely vote to have these accessions unique. Having the same accessions for differing entries is probably an error, and it leads to inconsistencies when mapping the peptides and PSMs to the proteins, in the given example.

colin-combe commented 4 years ago

I would vote for them not being unique.

First, there is the decoy/target example above. More generally, proteins can have the same accession number and different sequences - this is why we're not all clones, right?

If the example above leads to inconsistencies then it is an error in the software reading the file, because the id attributes are different?

lutzfischer commented 4 years ago

I also think they should be same - it is the decoy counter part for the target - and, unless we have a standard way to denoting them as target decoy pair, I would actually ask for them to have same accession.

Being able to match these up is important for FDR-estimations, as only this way you can make a meaningful separate (target decoy based) FDR for self/internal/intra vs between/inter.

colin-combe commented 1 year ago

could this issue be resolved/closed? I think there are reasons why they are not required to be unique. @andrewrobertjones - what do you think about this?

mobiusklein commented 1 year ago

You can have multiple search databases which could have overlapping entries, like searching all of the reviewed sequences of UniProt and then searching again with all the isoforms and unreviewed sequences enabled. The searchDatabase_ref tells you which database an entry should be resolved against. In order for the mzIdentML to be internally consistent, the id is the only field that absolutely has to be unique across all DBSequence entries.

The "supported" method for including decoy proteins in your search database involves adding some marker to the accession attribute of the DBSequence protein, and specifying a regex for matching that marker in your SearchDatabase element using MS:1001283 decoy DB accession regexp . (edit to correct accession per @colin-combe's catch)

Would it be better if there were an isDecoy attribute like on PeptideEvidence?

colin-combe commented 1 year ago

the id is the only field that absolutely has to be unique across all DBSequence entries

that seems sufficient info to close this

The "supported" method for including decoy proteins in your search database involves adding some marker to the accession attribute of the DBSequence protein, and specifying a regex for matching that marker in your SearchDatabase element using MS:1001450 decoy DB accession regexp

where is that documented? (apologies if it's obvious and I'm just being blind)

colin-combe commented 1 year ago

where is that documented?

right... its shown in the example in Section 7.5 of 1.2.0 spec (though it isn't discussed in the text).

It's because its accession is MS:1001283 (not MS:1001450 as in your message, though the link is correct in your message), that I didn't find it. (I searched for MS:1001450).

@lutzfischer - I think we've been unaware of this?

colin-combe commented 1 year ago

also, re. MS:1001283 - its incorrectly shown as an example CV param for DatabaseName (6.20, pg. 36)? I say 'incorrectly' because the CV mapping rules given for DatabaseName wouldn't allow it? All the example CV params given for DatabaseName are wrong?

mobiusklein commented 1 year ago

Thanks for catching the accession number error earlier. I was writing in a hurry and must have copied over the wrong accession from OLS.

I think you're right about the parameters in DatabaseName.

As-is, this could only be one of the children given here: https://www.ebi.ac.uk/ols/ontologies/ms/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FMS_1001013&lang=en&viewMode=All&siblings=false or a userParam.

colin-combe commented 1 year ago

i will make a seperate issue for the incorrect DatabaseName example cvParams.

Would it be better if there were an isDecoy attribute [on DBSequence] like on PeptideEvidence?

that sounds sensible to me, but then it is a change to the schema

lutzfischer commented 1 year ago

currently the only "reliable" way to detect if a protein is a decoy protein is to go via PeptideEvidence. But I guess there are other ways to have decoys besides extra decoy proteins - concatenated proteins come to mind - where only a part of the "protein" is decoy. Not sure what would be the best way to represent that.

For the case of distinct decoy proteins, actually the current spec document, at least implicitly, by example, suggests different accessions:

<SearchDatabase location="/localdirectory/18.E_coli_K12_edit.fasta" id="K12_nosignal" name="K12"
numDatabaseSequences="9376" releaseDate="01-2008-08-2008" version="1.0" >
    <FileFormat>
        <cvParam accession="MS:1001348" name="FASTA format" cvRef="PSI-MS"/>
    </FileFormat>
    <DatabaseName>
        <userParam name="18.E_coli_K12_edit.fasta" />
    </DatabaseName>
    <cvParam accession="MS:1001197" name="DB composition target+decoy" cvRef="PSI-MS"/>
    <cvParam accession="MS:1001283" name="decoy DB accession regexp" value="Rnd" cvRef="PSI-MS"/>
    <cvParam accession="MS:1001195" name="decoy DB type reverse" cvRef="PSI-MS"/>
</SearchDatabase>

HUPO-PSI / mzIdentML

accession attribute in DBSequence should be unique? #91