Open smdb21 opened 7 years ago
I would definitely vote to have these accessions unique. Having the same accessions for differing entries is probably an error, and it leads to inconsistencies when mapping the peptides and PSMs to the proteins, in the given example.
I would vote for them not being unique.
First, there is the decoy/target example above. More generally, proteins can have the same accession number and different sequences - this is why we're not all clones, right?
If the example above leads to inconsistencies then it is an error in the software reading the file, because the id attributes are different?
I also think they should be same - it is the decoy counter part for the target - and, unless we have a standard way to denoting them as target decoy pair, I would actually ask for them to have same accession.
Being able to match these up is important for FDR-estimations, as only this way you can make a meaningful separate (target decoy based) FDR for self/internal/intra vs between/inter.
could this issue be resolved/closed? I think there are reasons why they are not required to be unique. @andrewrobertjones - what do you think about this?
You can have multiple search databases which could have overlapping entries, like searching all of the reviewed sequences of UniProt and then searching again with all the isoforms and unreviewed sequences enabled. The searchDatabase_ref
tells you which database an entry should be resolved against. In order for the mzIdentML to be internally consistent, the id
is the only field that absolutely has to be unique across all DBSequence
entries.
The "supported" method for including decoy proteins in your search database involves adding some marker to the accession
attribute of the DBSequence
protein, and specifying a regex for matching that marker in your SearchDatabase
element using MS:1001283 decoy DB accession regexp
. (edit to correct accession per @colin-combe's catch)
Would it be better if there were an isDecoy
attribute like on PeptideEvidence
?
the id is the only field that absolutely has to be unique across all DBSequence entries
that seems sufficient info to close this
The "supported" method for including decoy proteins in your search database involves adding some marker to the accession attribute of the DBSequence protein, and specifying a regex for matching that marker in your SearchDatabase element using MS:1001450 decoy DB accession regexp
where is that documented? (apologies if it's obvious and I'm just being blind)
where is that documented?
right... its shown in the example in Section 7.5 of 1.2.0 spec (though it isn't discussed in the text).
It's because its accession is MS:1001283 (not MS:1001450 as in your message, though the link is correct in your message), that I didn't find it. (I searched for MS:1001450).
@lutzfischer - I think we've been unaware of this?
also, re. MS:1001283 - its incorrectly shown as an example CV param for DatabaseName (6.20, pg. 36)? I say 'incorrectly' because the CV mapping rules given for DatabaseName wouldn't allow it? All the example CV params given for DatabaseName are wrong?
Thanks for catching the accession number error earlier. I was writing in a hurry and must have copied over the wrong accession from OLS.
I think you're right about the parameters in DatabaseName
.
As-is, this could only be one of the children given here: https://www.ebi.ac.uk/ols/ontologies/ms/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FMS_1001013&lang=en&viewMode=All&siblings=false
or a userParam
.
i will make a seperate issue for the incorrect DatabaseName example cvParams.
Would it be better if there were an isDecoy attribute [on DBSequence] like on PeptideEvidence?
that sounds sensible to me, but then it is a change to the schema
currently the only "reliable" way to detect if a protein is a decoy protein is to go via PeptideEvidence. But I guess there are other ways to have decoys besides extra decoy proteins - concatenated proteins come to mind - where only a part of the "protein" is decoy. Not sure what would be the best way to represent that.
For the case of distinct decoy proteins, actually the current spec document, at least implicitly, by example, suggests different accessions:
<SearchDatabase location="/localdirectory/18.E_coli_K12_edit.fasta" id="K12_nosignal" name="K12"
numDatabaseSequences="9376" releaseDate="01-2008-08-2008" version="1.0" >
<FileFormat>
<cvParam accession="MS:1001348" name="FASTA format" cvRef="PSI-MS"/>
</FileFormat>
<DatabaseName>
<userParam name="18.E_coli_K12_edit.fasta" />
</DatabaseName>
<cvParam accession="MS:1001197" name="DB composition target+decoy" cvRef="PSI-MS"/>
<cvParam accession="MS:1001283" name="decoy DB accession regexp" value="Rnd" cvRef="PSI-MS"/>
<cvParam accession="MS:1001195" name="decoy DB type reverse" cvRef="PSI-MS"/>
</SearchDatabase>
When parsing example file https://github.com/HUPO-PSI/mzIdentML/blob/master/examples/1_2examples/crosslinking/xiFDR-CrossLinkExample.mzid, I find these 2 protein entries as DBSequence elements:
Although the protein entries are different (one is the decoy entry of the other), the accession attribute is the same. My question is: should the accession attribute be unique? In the specification document says this about the accession:
This caused my a problem because I am collecting all proteins in a map in which the key is the accession.
What do you think?