Closed andrewrobertjones closed 8 years ago
Further note on this comment. In our software (ProteoAnnotator), we search a database concatenated from several sources. Placing a single cvParam for genome reference version on SearchDatabase does not allow us to tell which genome (version) each PeptideEvidence should be mapped against. Some options for how to solve this:
Other options?
@andrewrobertjones I will go for the first option if is possible, because we can use the information for viz and will help to reproduce the results. The second option would be difficult because the users would need to add/submit/provide the corresponding external file making difficult the use of this implementation in practical terms.
@ypriverol (comment from Andy): I agree that umambiguous encoding is definitely a good thing to aim for. However, the example we currently have in our software:
Doesn't actually contain enough information to know that this is Ensembl version 77, so a bespoke parser would be needed to export to proBED or proBAM anyway. In practice, we would need some extra info not currently in the file to know to map this specifically onto Ensembl.
I have written in the new specs: "These seven cv terms MUST be present on every PeptideEvidence, unless isDecoy=“true”, in which case they are optional."
@germa The validator would need updating to check this if everyone agrees?
Further comment on CV terms:
`
Hi @andrewrobertjones for a quick understand, why do you have multiple values in the same CVTerm? Hi @ypriverol which CV term are you referring to?
Seems this issue was closed - I presume that was a mistake
@andrewrobertjones
...
Hi @ypriverol These are the start positions within exons. This peptide is mapped across a splice junction, so there is the start point within the first exon, then start position within exon 2. I have added the explanation of the encoding to the spec doc.
About the concatenated searches, this is not a unique case for proteogenomics. People can make the same (e.g. organisms that don't have a complete genome sequences, and sequences are merged together from different sources). I don't know if this case was explained before in detail in the specification document, but whatever is decided for proteogenomics should be applicable to these cases as well.
Do we need to consider at all that a zero-based coordinate system in used in Bed and BAM files, but not in GFF3/GTF files? I guess it is good to assume that we will use a zero-based coordinate system by default. This should be written in the specification document.
@javizca I would vote for the coordinate system being used in mzIdentML is always zero based. If you writing back to GFF afterwards, you would need to do a conversion. This seems safer than having two ways of doing it, and then assuming a reading software will check this parameter.
These points were provisionally agreed on last call. @germa would you mind updating the mapping and validator. Please email me if anything is unclear. @fghali please can you update our example file before the call tomorrow if possible.
@fghali can you let us know when this has been checked versus example files, then close this issue
I'm getting these errors when trying to validate the file using: mzIdentMLValidator_GUI_v1.4.16-SNAPSHOT
` Message 1: Rule ID: PeptideLevelStatsObjectRule Level: ERROR Context(/MzIdentML/DataCollection/AnalysisData/SpectrumIdentificationList/SpectrumIdentificationResult/SpectrumIdentificationItem ) --> The SpectrumIdentificationItem (id='SIR_3007_SII_2') element at /MzIdentML/DataCollection/AnalysisData/SpectrumIdentificationList/SpectrumIdentificationResult/SpectrumIdentificationItem doesn't contain the triplet of terms MS:1002520 (peptide group ID), MS:1002500 (peptide passes threshold) and a child of MS:1002358 (search engine specific score for distinct peptides) required in case of peptide-level scoring Tip: Add the triplet of terms MS:1002520 (peptide group ID), MS:1002500 (peptide passes threshold) and a child of MS:1002358 (search engine specific score for distinct peptides) to each SpectrumIdentificationItem
Message 2: Rule ID: SpectrumIdentificationList_must_rule Level: ERROR Context(/cvParam/@accession ) in 2 locations --> None of the given CvTerms were found at '/MzIdentML/DataCollection/AnalysisData/SpectrumIdentificationList/cvParam/@accession' because no values were found:
Message 3: Rule ID: ProteinDetectionList_must_rule Level: ERROR Context(/cvParam/@accession ) in 2 locations --> None of the given CvTerms were found at '/MzIdentML/DataCollection/AnalysisData/ProteinDetectionList/cvParam/@accession' because no values were found:
@germa Can you investigate Message 1, I can't obviously spot what is wrong with our file https://github.com/HUPO-PSI/mzIdentML/blob/master/examples/1_2examples/ProteoAnnotator/ProteoAnnotator_1_2.mzid.gz
In terms of Message 2, we decided to role back on this rule so it is no longer required, and can be safely deleted from mapping file:
Rule ID: SpectrumIdentificationList_must_rule
@fghali Message 3 is something for you to fix. MS:1002404 needs to be on the ProteinDetectionList, with a value of the number of PAGs with passThreshold=true. This should be inserted by ProteoGrouper. If you need help with this part of (my) code, let me know
Message 1: PeptideLevelStatsObjectRule The reason is that the EBI-OLS changed. I think the validation framework from the EBI, from which our validator inherits is not yet adapted to the new Open Lookup Service.
Thanks @germa, I had not realised about this. In the last 2 weeks we had to update some of the OLS related libraries that are used in our tools because of this. We need to investigate this in detail.
@germa we implemented the new ols-client and ols-dialog you can used them. Please they are here:
OLS-CLIENT https://github.com/PRIDE-Utilities/ols-client OLS_DIALOG https://github.com/PRIDE-Toolsuite/ols-dialog
Hi,
Related with OLS too. Max from PSI-MI is updating the ontology-manager to the new version from OLS because it is needed for IntAct. I think it is a dependency for the validator framework too. Previously was hosted in the sourceforge svn with other tools https://sourceforge.net/p/psidev/svn/HEAD/tree/psi/tools/ontology-manager/ We are thinking that if you agree, it can be moved to GitHub under HUPO-PSI and share there our changes because they are common libraries and tools for PSI, however until this moment only the specification documents have been hosted in the organisation. What do you think?
I think it is perfectly fine
I am adding the proteogenomics encoding to the spec doc. I am opening this issue to check that the validator @germa checks that this term is present in SIProtocol:
<cvParam cvRef="PSI-MS" accession="MS:1002635" name="proteogenomics search"></cvParam>
And then expects ALL the following elements to be present on every PeptideEvidence:
SearchDatabase MUST have the genome reference version: