Proteogenomics encoding

andrewrobertjones commented 8 years ago

I am adding the proteogenomics encoding to the spec doc. I am opening this issue to check that the validator @germa checks that this term is present in SIProtocol:

<cvParam cvRef="PSI-MS" accession="MS:1002635" name="proteogenomics search"></cvParam>

And then expects ALL the following elements to be present on every PeptideEvidence:

 <cvParam cvRef="PSI-MS" accession="MS:1002637" name="chromosome name" value="4"></cvParam>
    <cvParam cvRef="PSI-MS" accession="MS:1002638" name="chromosome strand" value="+"></cvParam>
    <cvParam cvRef="PSI-MS" accession="MS:1002639" name="peptide start on chromosome" value="73417647"></cvParam>
    <cvParam cvRef="PSI-MS" accession="MS:1002640" name="peptide end on chromosome" value="73418129"></cvParam>
    <cvParam cvRef="PSI-MS" accession="MS:1002641" name="peptide exon count" value="2"></cvParam>
    <cvParam cvRef="PSI-MS" accession="MS:1002642" name="peptide exon nucleotide sizes" value="24,42"></cvParam>
    <cvParam cvRef="PSI-MS" accession="MS:1002643" name="peptide start positions on chromosome" value="73417647,73418087"></cvParam>

SearchDatabase MUST have the genome reference version:

 <SearchDatabase numDatabaseSequences="299106" location="PXD000764_34939_combined_concatenated_target_decoy.fasta" id="SearchDB_1">
    <FileFormat>
      <cvParam cvRef="PSI-MS" accession="MS:1001348" name="FASTA format"></cvParam>
    </FileFormat>
    <DatabaseName>
      <userParam name="PXD000764_34939_combined_concatenated_target_decoy.fasta"></userParam>      
    </DatabaseName>
   <cvParam cvRef="PSI-MS" accession="MS:1002644" name="genome reference version" value="Homo_sapiens.GRCh38.77.gff3"/>
  </SearchDatabase>

andrewrobertjones commented 8 years ago

Further note on this comment. In our software (ProteoAnnotator), we search a database concatenated from several sources. Placing a single cvParam for genome reference version on SearchDatabase does not allow us to tell which genome (version) each PeptideEvidence should be mapped against. Some options for how to solve this:

Put genome reference version on every PeptideEvidence Pros: Unambiguous ; Cons: Verbose
Put multiple genome versions in SearchDatabase, allow human logic to figure out the mapping when running downstream software? Pros: Concise; Cons: Ambiguous
Enforce that multiple searchDatabase elements are present in the file, and DBSequence elements must reference the correct source Pros: Concise, unambigous; Cons: Software may not easily be able to do this without adapting the file format converter

Other options?

ypriverol commented 8 years ago

@andrewrobertjones I will go for the first option if is possible, because we can use the information for viz and will help to reproduce the results. The second option would be difficult because the users would need to add/submit/provide the corresponding external file making difficult the use of this implementation in practical terms.

@ypriverol (comment from Andy): I agree that umambiguous encoding is definitely a good thing to aim for. However, the example we currently have in our software:

Doesn't actually contain enough information to know that this is Ensembl version 77, so a bespoke parser would be needed to export to proBED or proBAM anyway. In practice, we would need some extra info not currently in the file to know to map this specifically onto Ensembl.

andrewrobertjones commented 8 years ago

I have written in the new specs: "These seven cv terms MUST be present on every PeptideEvidence, unless isDecoy=“true”, in which case they are optional."

@germa The validator would need updating to check this if everyone agrees?

andrewrobertjones commented 8 years ago

Further comment on CV terms: ` ...

` If I understand our method correctly, the first value of "peptide start positions on chromosome" MUST be the same as "peptide start on chromosome", done for BED compatibility. However, given this is completely redundant, I suggest we remove: "peptide start on chromosome" And just rely on the latter value. Comments?

ypriverol commented 8 years ago

Hi @andrewrobertjones for a quick understand, why do you have multiple values in the same CVTerm? Hi @ypriverol which CV term are you referring to?

andrewrobertjones commented 8 years ago

Seems this issue was closed - I presume that was a mistake

ypriverol commented 8 years ago

@andrewrobertjones

...

andrewrobertjones commented 8 years ago

Hi @ypriverol These are the start positions within exons. This peptide is mapped across a splice junction, so there is the start point within the first exon, then start position within exon 2. I have added the explanation of the encoding to the spec doc.

javizca commented 8 years ago

About the concatenated searches, this is not a unique case for proteogenomics. People can make the same (e.g. organisms that don't have a complete genome sequences, and sequences are merged together from different sources). I don't know if this case was explained before in detail in the specification document, but whatever is decided for proteogenomics should be applicable to these cases as well.

javizca commented 8 years ago

Do we need to consider at all that a zero-based coordinate system in used in Bed and BAM files, but not in GFF3/GTF files? I guess it is good to assume that we will use a zero-based coordinate system by default. This should be written in the specification document.

andrewrobertjones commented 8 years ago

@javizca I would vote for the coordinate system being used in mzIdentML is always zero based. If you writing back to GFF afterwards, you would need to do a conversion. This seems safer than having two ways of doing it, and then assuming a reading software will check this parameter.

andrewrobertjones commented 8 years ago

These points were provisionally agreed on last call. @germa would you mind updating the mapping and validator. Please email me if anything is unclear. @fghali please can you update our example file before the call tomorrow if possible.

Proteogenomics – Agreed to move chromosome name and strand to DBSequence (protein-level), also move Genome reference version to DBSequence; Remove “peptide start position on chromosome” as redundant info, this is not needed for the encoding

andrewrobertjones commented 8 years ago

@fghali can you let us know when this has been checked versus example files, then close this issue

fawazghali commented 8 years ago

I'm getting these errors when trying to validate the file using: mzIdentMLValidator_GUI_v1.4.16-SNAPSHOT

` Message 1: Rule ID: PeptideLevelStatsObjectRule Level: ERROR Context(/MzIdentML/DataCollection/AnalysisData/SpectrumIdentificationList/SpectrumIdentificationResult/SpectrumIdentificationItem ) --> The SpectrumIdentificationItem (id='SIR_3007_SII_2') element at /MzIdentML/DataCollection/AnalysisData/SpectrumIdentificationList/SpectrumIdentificationResult/SpectrumIdentificationItem doesn't contain the triplet of terms MS:1002520 (peptide group ID), MS:1002500 (peptide passes threshold) and a child of MS:1002358 (search engine specific score for distinct peptides) required in case of peptide-level scoring Tip: Add the triplet of terms MS:1002520 (peptide group ID), MS:1002500 (peptide passes threshold) and a child of MS:1002358 (search engine specific score for distinct peptides) to each SpectrumIdentificationItem

Message 2: Rule ID: SpectrumIdentificationList_must_rule Level: ERROR Context(/cvParam/@accession ) in 2 locations --> None of the given CvTerms were found at '/MzIdentML/DataCollection/AnalysisData/SpectrumIdentificationList/cvParam/@accession' because no values were found:

Any children term of MS:1002438 (spectrum identification list result details). A single instance of this term can be specified. The matching value has to be the identifier of the term, not its name.

Message 3: Rule ID: ProteinDetectionList_must_rule Level: ERROR Context(/cvParam/@accession ) in 2 locations --> None of the given CvTerms were found at '/MzIdentML/DataCollection/AnalysisData/ProteinDetectionList/cvParam/@accession' because no values were found:

The sole term MS:1002404 (count of identified proteins) or any of its children. A single instance of this term can be specified. The matching value has to be the identifier of the term, not its name. `

andrewrobertjones commented 8 years ago

@germa Can you investigate Message 1, I can't obviously spot what is wrong with our file https://github.com/HUPO-PSI/mzIdentML/blob/master/examples/1_2examples/ProteoAnnotator/ProteoAnnotator_1_2.mzid.gz

In terms of Message 2, we decided to role back on this rule so it is no longer required, and can be safely deleted from mapping file:

Rule ID: SpectrumIdentificationList_must_rule

@fghali Message 3 is something for you to fix. MS:1002404 needs to be on the ProteinDetectionList, with a value of the number of PAGs with passThreshold=true. This should be inserted by ProteoGrouper. If you need help with this part of (my) code, let me know

germa commented 8 years ago

Message 1: PeptideLevelStatsObjectRule The reason is that the EBI-OLS changed. I think the validation framework from the EBI, from which our validator inherits is not yet adapted to the new Open Lookup Service.

javizca commented 8 years ago

Thanks @germa, I had not realised about this. In the last 2 weeks we had to update some of the OLS related libraries that are used in our tools because of this. We need to investigate this in detail.

ypriverol commented 8 years ago

@germa we implemented the new ols-client and ols-dialog you can used them. Please they are here:

OLS-CLIENT https://github.com/PRIDE-Utilities/ols-client OLS_DIALOG https://github.com/PRIDE-Toolsuite/ols-dialog

noedelta commented 8 years ago

Hi,

Related with OLS too. Max from PSI-MI is updating the ontology-manager to the new version from OLS because it is needed for IntAct. I think it is a dependency for the validator framework too. Previously was hosted in the sourceforge svn with other tools https://sourceforge.net/p/psidev/svn/HEAD/tree/psi/tools/ontology-manager/ We are thinking that if you agree, it can be moved to GitHub under HUPO-PSI and share there our changes because they are common libraries and tools for PSI, however until this moment only the specification documents have been hosted in the organisation. What do you think?

javizca commented 8 years ago

I think it is perfectly fine

HUPO-PSI / mzIdentML

Proteogenomics encoding #4