bigbio / proteomics-sample-metadata

The Proteomics Experimental Design file format: Standard for experimental design annotation
GNU General Public License v2.0
75 stars 106 forks source link

Encoding PTMs parameters into one-line Experimental Design #13

Closed ypriverol closed 4 years ago

ypriverol commented 5 years ago

@hbarsnes @mvaudel @StSchulze

We have continued working with the metadata experimental design.

See example, https://github.com/PRIDE-Archive/pride-metadata-standard/tree/master/experimental-design#2-the-sample-and-data-relationship-format

However, if we want to encode search parameters would be great to encode PTMs and other search parameters as key-value pairs. I have seen that MSGF+, Comet, MaxQuant encode PTMs as string lines which is great; because we can encode PTMs Variables as a string and will be easy to translate into the Search Strings.

MSGF+ :

StaticMod=C2H3N1O1,     C,  fix, any,       Carbamidomethyl       # Fixed Carbamidomethyl C (alkylation)
StaticMod=229.1629,     *,  fix, N-term,    TMT6plex
StaticMod=229.1629,     K,  fix, any,       TMT6plex

Comet:

variable_mod1 = 15.9949 M 0 3
variable_mod2 = 0.0 X 0 3
variable_mod3 = 0.0 X 0 3
variable_mod4 = 0.0 X 0 3
variable_mod5 = 0.0 X 0 3
variable_mod6 = 0.0 X 0 3

CRUX:

C+57.02146,2M+15.9949,1STY+79.966331

I think we can propose a way to encode this PTMs as String within the metadata files.

Name ; aminoacid; type; position; UnimodAccession

Where: Name: Name of the modification. aminoacid: Aminoacid Type: Fixed, Variable, Custom Position: Any, N-Term, Protein N-term UnimodAccession: Unimod Accession

The Unimod accession can be replaced with delta mass.

trishorts commented 5 years ago

woops

RalfG commented 5 years ago

Should open search engines and multi-round search engines include all the modifications they could consider or in some way communicate the range of "dark mass" they allow?

For "dark modifications" we can use a name Unknown modification and mass shift and all possible amino acids.

For open modification search engines that search for a (very large) fixed list of modifications, this would work. But some open modifications search engines do not have an a priori list of modifications to search for. For those search engines, it would be good to include an any mass shift or open search tag in the data analysis protocol.

mobiusklein commented 5 years ago

We've talked about glycosylation site motifs, how about glycans themselves? When you look at PRIDE's glycoproteomics entries, they do not explicitly specify that the study looked at glycopeptides, and what the glycan database was. Depending upon what you're looking for, that can be anywhere from five to over nine thousand different glycans, represented at different levels of specificity. Is this something to capture in this one-line description scheme?

We clarify already that for large scale annotation of PTMs search we should use database annotations like PEEF.

Glycoproteomics search engines do not use "site specific" databases, though should the repositories become complete enough, that'd be desirable. Most of them simply put every single glycan of the appropriate type at each site just like any other variable modification. PEFF has not yet standardized how to communicate the range of glycoforms expected at a specific site, simply that a site is glycosylated.

If including just a very long list of modifications is sufficient, then this should work for glycoproteomics too, provided we have an acceptable way to encode our glycans. If that defeats the purpose of this format, then both glycoproteomics and those open modification search engines with a large database of modifications both might not have an appropriate method to be described by.

jpfeuffer commented 5 years ago

Hi @ypriverol and others:

I was wondering how one would represent mutually exclusive modifications like SILAC modifications: Some search engines like Comet allow for a simultaneous search of such modifications (encoded in the "binary group" column of its parameters at the end of the page here). With other search engines you might need to search multiple times with the same non-quantification modification and one of the quantification modifications in the group (and afterwards merge the results). I could imagine either introducing another key/value pair for such a "binary group" and/or allowing multiple rows for the same Run to represent different Samples.

Anyone thought about that already?

ypriverol commented 5 years ago

@jpfeuffer Can you propose how to encode that into a key=value representation.

jpfeuffer commented 5 years ago

Maybe an optional key "BG" for every modification with integer values representing the group of modifications that should be/were searched together in a binary (all-or-none) way. If this optional key is missing the modification is handled as usual (and considered on its own). You could adapt the description from the Comet page in your documentation.

If the searches were performed separately e.g. with another search engine, the user can still go for multiple rows I think, so no loss of generality here.

ypriverol commented 5 years ago

@jpfeuffer I was thinking that most of the search engines used SILAC and multiplex modifications as Variable modifications and this solves the problem of the binary.

ypriverol commented 4 years ago

Thanks to all for your comments, I will close this issue because we have a proposal now https://github.com/bigbio/proteomics-metadata-standard/tree/master/experimental-design#encoding-protein-modifications