bigbio / proteomics-sample-metadata

The Proteomics Experimental Design file format: Standard for experimental design annotation
GNU General Public License v2.0
75 stars 106 forks source link

Encoding PTMs parameters into one-line Experimental Design #13

Closed ypriverol closed 4 years ago

ypriverol commented 4 years ago

@hbarsnes @mvaudel @StSchulze

We have continued working with the metadata experimental design.

See example, https://github.com/PRIDE-Archive/pride-metadata-standard/tree/master/experimental-design#2-the-sample-and-data-relationship-format

However, if we want to encode search parameters would be great to encode PTMs and other search parameters as key-value pairs. I have seen that MSGF+, Comet, MaxQuant encode PTMs as string lines which is great; because we can encode PTMs Variables as a string and will be easy to translate into the Search Strings.

MSGF+ :

StaticMod=C2H3N1O1,     C,  fix, any,       Carbamidomethyl       # Fixed Carbamidomethyl C (alkylation)
StaticMod=229.1629,     *,  fix, N-term,    TMT6plex
StaticMod=229.1629,     K,  fix, any,       TMT6plex

Comet:

variable_mod1 = 15.9949 M 0 3
variable_mod2 = 0.0 X 0 3
variable_mod3 = 0.0 X 0 3
variable_mod4 = 0.0 X 0 3
variable_mod5 = 0.0 X 0 3
variable_mod6 = 0.0 X 0 3

CRUX:

C+57.02146,2M+15.9949,1STY+79.966331

I think we can propose a way to encode this PTMs as String within the metadata files.

Name ; aminoacid; type; position; UnimodAccession

Where: Name: Name of the modification. aminoacid: Aminoacid Type: Fixed, Variable, Custom Position: Any, N-Term, Protein N-term UnimodAccession: Unimod Accession

The Unimod accession can be replaced with delta mass.

prvst commented 4 years ago

So you are basically talking about how the modifications are declared inside the parameter files, not on how they are represented inside the search results, right ? In this case I think that the proposal should be more human-readable than machine friendly. Parameter files are often shared with people who are not entirely familiar with proteomics or even sometimes used as proof of documentation for an analysis. A format like the on used by Comet variable_mod1 = 15.9949 M 0 3 might be easy to be consumed by a software, but quite impossible to be interpreted by a person who doesn't know the documentation. I also think that the name of the modification should be included in the proposal, it makes easier to spot errors and to differentiate isobaric PTMs.

mlocardpaulet commented 4 years ago

Veit (and others) worked on this at proteoform level. But I think you could find their strategy interesting: LeDuc, R. D., Schwämmle, V., Shortreed, M. R., Cesnik, A. J., Solntsev, S. K., Shaw, J. B., … Tsybin, Y. O. (2018). ProForma: A Standard Proteoform Notation. Journal of Proteome Research, 17(3), 1321–1325. https://doi.org/10.1021/acs.jproteome.7b00851

ypriverol commented 4 years ago

Thanks, @mlocardpaulet for the reference. We are more talking about PTMs as search parameters. In order to represent PTMs in results, we have good references (as you said) with ProForma, MzTab, and others.

The problem we want to solve is that if we annotate a PRIDE or ProteomeXchange experiment, we should annotate some parameters from the search in order to allow external tools like SearchGUI and other to reanalyze the data. This issue is about how to encode Search Parameters PTMs.

RalfG commented 4 years ago

We often do something like this, in a JSON structure:

    "modifications":[
        {"name":"Glu->pyro-Glu", "unimod_accession":27, "mass_shift":-18.0153, "amino_acid":"E", "n_term":true, "fixed":false},
        {"name":"Gln->pyro-Glu", "unimod_accession":28, "mass_shift":-17.0305, "amino_acid":"Q", "n_term":true, "fixed":false},
        {"name":"Acetyl", "unimod_accession":1, "mass_shift":42.0367, "amino_acid":null, "n_term":true, "fixed":false},
        {"name":"Oxidation", "unimod_accession":35, "mass_shift":15.9994, "amino_acid":"M", "n_term":false, "fixed":false},
        {"name":"Carbamidomethyl", "unimod_accession":4, "mass_shift":57.0513, "amino_acid":"C", "n_term":false, "fixed":true}
    ],
ypriverol commented 4 years ago

Thanks @RalfG In the current proposal

Name ; aminoacid; type; position; UnimodAccession

Would be feasible to represent your JSON? We want first to have a tab-delimited representation to align more with the experimental design but in the future YES, we will serialize also to JSON.

In my proposal your first modification will be like:

Glu->pyro-Glu; E; fixed;  N-term; UNIMOD:27

The only thing missing is the mass shift, I didn't include it because it can be retrieved from the UNIMOD accession. However, I agree we can be more explicit using the mass shift.

mvaudel commented 4 years ago

A few points you might want to consider:

ypriverol commented 4 years ago

@mvaudel:

Here my comments.

A few points you might want to consider:

  • The target can be a single amino acid or an amino acid pattern (like in glyco). This can be encoded as a simple regular expression.

I like this idea. We should accept the pattern, However, what is the best way to encode a pattern in am standardize way. I can see here a lot of software and users writing their own pattern rules that are difficult to translate into a specific language. I found a link about how to standarize regular expressions https://www.regular-expressions.info/refflavors.html. Probably a good place to start.

  • The terminus can be peptide or protein.

I think this is really common now, if we use unimod definitions will be:

  • I strongly recommend not to use the rounded mass, and rather stick to the atomic composition. I would make the atomic composition mandatory.

We can explicitly as for the atomic mass, however, MOST of the search engines and tools currently use the mass_shift. In addition, if we go for the tab-delimited user-friendly option mass shift is easier to get that the Atomic Composition. I really think we should not add a lot of details if the UNIMOD accession is known. If the Unimod is not known then the composition can be the name of the modification?

  • If you are aiming for a format like mzIdentML, generated by software for software, user friendliness is not that much of an issue, we should rather focus on ease and speed of parsing?

We are aiming in a tab-delimited format easy to produce by software but also easy to produce/read manually by submitters and enriched by our submission tools. For example, a user should be able to specify a fixed modification like this:

Glu->pyro-Glu; E; fixed;  N-term; UNIMOD:27

You should be able with searchGUI to pick from there and go on with the reanalysis.

ypriverol commented 4 years ago

Hi @prvst

So you are basically talking about how the modifications are declared inside the parameter files, not on how they are represented inside the search results, right ?

Yes, you should be able to go from here to an MSFragger parameters files and perform a reanalysis of your dataset.

In this case, I think that the proposal should be more human-readable than machine friendly.

Agree, but we should force to put enough information to enable the machines to enrich the files and perform the reanalysis. For example, if the UNIMOD id is provided, we don't need to add some of the fields... like composition, see my point with @mvaudel .

Parameter files are often shared with people who are not entirely familiar with proteomics or even sometimes used as proof of documentation for analysis. A format like the on used by Comet variable_mod1 = 15.9949 M 0 3 might be easy to be consumed by a software, but quite impossible to be interpreted by a person who doesn't know the documentation. I also think that the name of the modification should be included in the proposal, it makes easier to spot errors and to differentiate isobaric PTMs.

Agree @prvst, this is why we are adding some words rather than binary variables 0/1 values.

ypriverol commented 4 years ago

@mvaudel about the regular expression, probably this is the standard:

http://pubs.opengroup.org/onlinepubs/9699919799/

trishorts commented 4 years ago

For MetaMorpheus, we got our start using the UniProt ptmlist and just retained that format. There is a key value pair system that is pretty easy to interpret. There are mandatory fields and bonus fields. For example, we add diagnostic ions and neutral losses (dependent on fragmentation type). We also have a field that carriers equivalent accession numbers for the same mod in different database systems.

Here is an example:

ID Phosphorylation TG S or T PP Anywhere. NL HCD:H0 or HCD:H3 O4 P1 MT Common Biological CF H1 O3 P1 DR Unimod; 21. //

However, we use a .toml file for search settings, which is probably where the data you're mining would come from. In that file for mods, we have only PTM name and target motif. That combination is required to be unique for us.

ypriverol commented 4 years ago

@trishorts can you provide me .toml file.

trishorts commented 4 years ago

sure thing. Let me create one with interesting PTMs.

ypriverol commented 4 years ago

Following @trishorts idea of key=value pairs for each property, we can update my first proposal:

Name ; aminoacid; type; position; UnimodAccession

example:

Glu->pyro-Glu; E; fixed; N-term; UNIMOD:27

We can improve it using the key=value structure:

ID=Glu->pyro-Glu; TG=E; TP=fixed; PP=Anywhere; UA=Unimod:27; CF=H(-2)O(-1)

With this approach, we can control the key name (ID, TG, TY, TP, PP, UA CF ..) and extended it in the specification. This will cover the use case from @mvaudel to add the Composition (CF). Also, the order of the property does not matter because this is control by the key.

The downside of this approach is that is less Human readable.

@mvaudel @trishorts @prvst @RalfG opinions welcome.

BTW @trishorts, What means TG

mvaudel commented 4 years ago

Sounds really nice, having explicit labels increases readability and flexibility. Indeed neutral losses and reporter ions are needed, thanks for putting this up. Here again, we use atomic composition and never rounded mass ;)

ypriverol commented 4 years ago

@mwalzer actually highlighted that we need to define what is optional and mandatory to be able to define a modification parameter.

I think the only mandatory value would be a name ID because with the name phosphorylation we can guess most of the other fields.

mwalzer commented 4 years ago

I like the key=value idea. So it would be that a key can occur multiple times and be interpreted as a virtual list? I dont like much the use of different separation chars.

Two potential issues that I see in general are:

mvaudel commented 4 years ago

My personal experience is - don't rely on Unimod.

trishorts commented 4 years ago

Here are the key value pairs that MetaMorpheus uses with brief explanation

An accession number of frequently supplied by the primary databases (e.g. UniProt and Unimod).

This is the chemical formula of the added or removed atoms. This is required but the mass shift used is specified by MM. The particular isotope of the element can be specified in curly braces following the element name. For example, carbon-13 is written as C{13} in the chemical formula. The number of atoms is specified after the closing brace. Five carbon-13 atoms is written as C{13}5.

Certain PTMs (e.g. acetylation or glycosylation) produce small diagnostic fragment ions that can be detected in MS/MS spectra. These ions can serve as useful indicators of the presence of the corresponding PTM. This feature is currently disabled.

Used in the UniProt ptmlist but not needed for custom mods in MetaMorpheus

This is the text used to describe the modification in the output.

The exact atomic mass shift produced by the modification. Please use at least 5 decimal places of accuracy. This will override the monoisotopic mass described in the chemical formula because there are cases where the mass of the mod and the mass shift from the mod are different (e.g. trimethylation has mass of 43 but mass shift from trimethylation is 42).

This specifies which modification group the modification should be included with. Existing modification types are described here. The user is free to designate their own type, which creates a separate list.

Certain PTMs (e.g. phosphorylation) have labile modifications that can be lost during ionization. The peptide parent mass in MS1 may be seen with or with out the modification. Specifying neutral loss tells MetaMorpheus to take this phenomenon into account.

Choose from the following options: Anywhere.; Peptide N-terminal.; N-terminal.; Peptide C-terminal. DON'T FORGET THE '.'

Amino acid letter code capitalized or written out. Multiple targets separated by " or ". The capital letter 'X' may be used to mean any amino acid.

ypriverol commented 4 years ago

@mwalzer some comments here:

I like the key=value idea. So it would be that a key can occur multiple times and be interpreted as a virtual list? I dont like much the use of different separation chars.

I don't see in this particular case we can have more that one value for one particular key. That will be a different modification.

The idea would be:

comment [modification parameters] comment [modification parameters]
sample 1 ID=Glu->pyro-Glu; TG=E; TP=fixed; PP=Anywhere; UA=Unimod:27; CF=H(-2)O(-1) ID=Oxidation; TG=M
sample 2 ID=Glu->pyro-Glu; TG=E; TP=fixed; PP=Anywhere; UA=Unimod:27; CF=H(-2)O(-1) ID=Oxidation; TG=M

Two potential issues that I see in general are:

  • how should a consumer interpret a metadata file with such PTM encoding when some keys are (because optional) missing

Actually, this is a great point. The consumers of the metadata can take decisions depending on the data missing. For example, In PRIDE we will implement a system that annotates as much a possible this values; but if the user submits only the name we can actually suggest to the user the possible modifications in Unimod.

  • and how to cope with conflicting information, say for example the unimod has different positions in store as given via the encoding

This is up to the system, software consumer to decide what to do. For example, we have a library that if a delta mass + name of the modification is provided and it matches uniquely to one UNIMOD modification, then it can suggest that modification.

trishorts commented 4 years ago

We've found that you have to be very careful with "separators". Places like Unimod can be very sloppy. So you end up with a modification name that contains a comma or a semicolon and your whole reader goes splat.

ypriverol commented 4 years ago

Agree.

We've found that you have to be very careful with "separators". Places like Unimod can be very sloppy. So you end up with a modification name that contains a comma or a semicolon and your whole reader goes splat.

I check before the proposal and ; is not included in any Interim Name in Unimod. Then, we are probably fine. But, if the user uses the description then we can have some conflicts (e.g. Loss of O; nitro photochemical decomposition)

mvaudel commented 4 years ago

Can we not just use quotes for all values?

ypriverol commented 4 years ago

@mvaudel the problem is that then the users need to know that they need to type quotes, I don't think researchers are used to writing excel/tab-delimited files and define the quotes. We can specify that if ; is present then quote should be specified.

But I really don't think ; would be common for any of the values.

lgatto commented 4 years ago

Some thoughts we had as part of your R/Bioconductor projects to represent PTMs in general (so not specifically for parameter files). See this issue for details, but in brief, a PTM can be id defined by what modification is described, how many there are (n) and its position:

Some examples:

> Modification("Acetyl:N-term")
Acetyl:N-term is positional.
> Modification("Acetyl:C-term", n = 2L)
Acetyl:C-term is fixed, positional.
> Modification("Acetyl:K", n = 0:1)
Acetyl:K is variable.
> Modification("Acetyl:C-term", n = 0:1)
Acetyl:C-term is variable, positional.
> Modification("Methyl:S", pos = 2L, n = 0:3)
Methyl:S is positional.
> Modification("Methyl:S", pos = 2L, n = 1L)
Methyl:S is fixed, positional.
> Modification("Methyl:S", pos = 2L, n = 0:1)
Methyl:S is variable, positional.

Not sure this fits what you are looking for, but I imagine that on our side we would be keen to have readers/writers from our object model to your format.

ypriverol commented 4 years ago

Thanks, @lgatto for your comments.

trishorts commented 4 years ago

It did just occur to me that we search PTMs encoded at the specific protein level as well as the general level. As @lgatto just says, we have a "variable" set, which is usually methionine oxidation; we have a "fixed" set, which is usually carbamidomethylation or maybe a label (e.g. TMT); and we have the specific protein level. What do I mean by that? Well, UniProt has annotated PTMs on many of its proteins. We read in all of those annotations in a version of their database that is formated .xml. Then we look for all those specific PTMs only on those peptides where they are annotated. So, our searches can have many dozens of different PTMs,. But, they are not looked for everywhere. So, we don't specify them in the search parameters. In fact, it might be confusing to list them all in the search parameters because they are limited to specific proteins.

trishorts commented 4 years ago

I did a quick search on a couple files and here is an example of the variety of mods seen:

Localized mods seen below q-value 0.01: Carbamidomethyl on C 1304 Oxidation on M 872 Deamidation on N 191 Water Loss on E 121 Hydroxylation on P 118 Hydroxylation on K 100 Hydroxylation on N 80 Acetylation on X 54 Ammonia loss on C 51 Ammonia loss on N 49 Deamidation on Q 34 Phosphorylation on S 31 Carbamyl on X 29 Nitrosylation on Y 27 Methylation on K 16 Formylation on K 10 Carboxylation on E 8 Carbamyl on M 6 Acetylation on K 6 Citrullination on R 5 Methylation on R 4 Carbamyl on K 3 Dimethylation on R 3 Phosphorylation on T 3 Carbamyl on C 3 Phosphorylation on Y 3 Carboxylation on D 2 Pyridoxal phosphate on K 2 Butyrylation on K 2 Hydroxybutyrylation on K 1 Glutarylation on K 1 Nitrosylation on C 1 Trimethylation on K 1 Carboxylation on K 1 Glu to PyroGlu on Q 1 ADP-ribosylation on S 1 Dimethylation on K 1

trishorts commented 4 years ago

This is a list of what was included in the original search: All mods in database limited to peptides observed in the results: Hydroxylation on K 495 Deamidation on N 372 Hydroxylation on P 320 Hydroxylation on N 241 Deamidation on Q 239 Citrullination on R 136 Ammonia loss on N 132 Water Loss on E 126 Acetylation on K 124 Methylation on K 95 Phosphorylation on S 91 Acetylation on X 65 Formylation on K 61 Ammonia loss on C 53 Carbamyl on K 48 Carbamyl on X 47 Methylation on R 43 Phosphorylation on T 37 Carboxylation on E 37 Nitrosylation on Y 33 Phosphorylation on Y 18 Dimethylation on K 16 Carbamyl on R 15 Carboxylation on D 15 Sulfonation on Y 14 Dimethylation on R 13 Carboxylation on K 13 Carbamyl on C 11 Carbamyl on M 10 Trimethylation on K 6 Pyridoxal phosphate on K 6 Glutarylation on K 5 Butyrylation on K 5 Hydroxybutyrylation on K 4 Nitrosylation on C 4 ADP-ribosylation on S 3 HexNAc on T 2 HexNAc on S 2 Malonylation on K 2 Glu to PyroGlu on Q 2

trishorts commented 4 years ago

And the .toml that didn't have any of them. TaskType = "Search"

[SearchParameters] DisposeOfFileWhenDone = true DoParsimony = false ModPeptidesAreDifferent = false NoOneHitWonders = false MatchBetweenRuns = false Normalize = false QuantifyPpmTol = 5.0 DoHistogramAnalysis = false SearchTarget = true DecoyType = "Reverse" MassDiffAcceptorType = "OneMM" WritePrunedDatabase = false KeepAllUniprotMods = true DoLocalizationAnalysis = true DoQuantification = false SearchType = "Classic" LocalFdrCategories = ["FullySpecific"] MaxFragmentSize = 30000.0 HistogramBinTolInDaltons = 0.003 MaximumMassThatFragmentIonScoreIsDoubled = 0.0 WriteMzId = true WritePepXml = false WriteDecoys = true WriteContaminants = true

[SearchParameters.ModsToWriteSelection] 'N-linked glycosylation' = 3 'O-linked glycosylation' = 3 'Other glycosylation' = 3 'Common Biological' = 3 'Less Common' = 3 Metal = 3 '2+ nucleotide substitution' = 3 '1 nucleotide substitution' = 3 UniProt = 2

[CommonParameters] TaskDescriptor = "SearchTask" MaxThreadsToUsePerFile = 27 ListOfModsFixed = "Common Fixed\tCarbamidomethyl on C\t\tCommon Fixed\tCarbamidomethyl on U" ListOfModsVariable = "Common Variable\tOxidation on M" DoPrecursorDeconvolution = true UseProvidedPrecursorInfo = true DeconvolutionIntensityRatio = 3.0 DeconvolutionMaxAssumedChargeState = 12 DeconvolutionMassTolerance = "±4.0000 PPM" TotalPartitions = 1 ProductMassTolerance = "±20.0000 PPM" PrecursorMassTolerance = "±5.0000 PPM" AddCompIons = false ScoreCutoff = 5.0 ReportAllAmbiguity = true NumberOfPeaksToKeepPerWindow = 200 MinimumAllowedIntensityRatioToBasePeak = 0.01 NormalizePeaksAccrossAllWindows = false TrimMs1Peaks = false TrimMsMsPeaks = true UseDeltaScore = false CalculateEValue = false QValueOutputFilter = 1.0 CustomIons = [] AssumeOrphanPeaksAreZ1Fragments = true MaxHeterozygousVariants = 4 MinVariantDepth = 1 DissociationType = "HCD" ChildScanDissociationType = "Unknown"

[CommonParameters.DigestionParams] MaxMissedCleavages = 2 InitiatorMethionineBehavior = "Variable" MinPeptideLength = 7 MaxPeptideLength = 2147483647 MaxModificationIsoforms = 1024 MaxModsForPeptide = 2 Protease = "trypsin" SearchModeType = "Full" FragmentationTerminus = "Both" SpecificProtease = "trypsin" GeneratehUnlabeledProteinsForSilac = true

ypriverol commented 4 years ago

We can't capture this complexity and this is mainly what PEEF and Uniprot annotations will be providing for future algorithms. I don't think we need to capture that part of the modifications. In the same way, we will be capturing Fragment/Precursor tolerances but we know Most of the search engines do two step searches where they can refine the parameters.

It did just occur to me that we search PTMs encoded at the specific protein level as well as the general level. As @lgatto just says, we have a "variable" set, which is usually methionine oxidation; we have a "fixed" set, which is usually carbamidomethylation or maybe a label (e.g. TMT); and we have the specific protein level. What do I mean by that? Well, UniProt has annotated PTMs on many of its proteins. We read in all of those annotations in a version of their database that is formated .xml. Then we look for all those specific PTMs only on those peptides where they are annotated. So, our searches can have many dozens of different PTMs,. But, they are not looked for everywhere. So, we don't specify them in the search parameters. In fact, it might be confusing to list them all in the search parameters because they are limited to specific proteins.

ypriverol commented 4 years ago

Hi all, taking into account the comments from @prvst @RalfG @mlocardpaulet @trishorts @mwalzer @lgatto and all the comments in this issue; I have made a first proposal https://github.com/PRIDE-Archive/pride-metadata-standard/pull/15 to encode PTM parameters into tab-delimited file (Experimental Design). Have a look to the PR and comment there the first prototype. We can elablorate more issues and discuss more complex topics after we get the first shcema moving...

Thanks a lot again !!!!

StSchulze commented 4 years ago

I know I'm coming in late to the discussion, but I agree that the key=value pair is a good solution. And I just wanted to add that, in my opinion and as @mvaudel said, chemical compositions should be required (or at least favored) instead of the monoisotopic mass, especially for mods that are defined by the user and not included in unimod, it would otherwise be impossible to get the composition (which is required for some engines).

Also, chemical compositions should be in a format that allows to define the isotope, since some modifications include specific isotopes.

trishorts commented 4 years ago

Yes. I think that chemical compositions can be very helpful. I prefer them myself. But there are tricky cases. Either trimethyl or acetyl (can't remember which) where the monoisotopic mass change does not match the chemical formula.

Currently MetaMorpheus require either mass or chemical formula be provided. one is given precedence over the other if both are supplied.

TMT is a wierd case where labels are "nominally" isobaric. There mass is a good way to split the differnce because there is no one absolute chemical formula that covers all cases.

RonBeavis commented 4 years ago

Regarding PTM specification in particular, I would like to support the idea that some type of motif/regular expression specification is necessary to properly instruct a search algorithm.

A very simple protein N-terminal processing can result in the loss of the initiator methionine, exposing the next residue as the mature N-terminus, which may be in turn acetylated. But only some residues can be acetylated (e.g., M,S,T,G, ...) while many others (I,L,F,Y,W) cannot. Therefore, for a peptide thought to contain a protein's N-terminal residue should only be acetylated if ^[GASTMCDN] is true for the peptide sequence.

Another very common case is for tryptic peptides modified at R and K residues. For example, in an experiment meant to detect lysine acetylation, testing for acetylation in the peptide

DVTTGYDSSQPNK

is wrong, as blocking the lysine group with acetyl means it cannot be cleaved by trypsin. The similarly, the peptide

DVTTGYDSSQPNKK

should only be tested for acetylation at the second last residue. The same holds true for many lysine modifications, e.g. methylation, ubiquitination, sumoylation, etc. Arginine has the same issue: most R modifications block the R's sidechain in such a way that it cannot be present at the C-terminus of a tryptic peptide. Failure to test for this condition has resulted in the incorrect assignment of many PTM sites in the literature.

A very common modification that is often missed in proteomics results is the presence of hydroxyproline and hydroxylysine in collagens. Because collagens are the most abundant protein in many tissues, this can mean leaving 5-10% of possible PSMs off the list & severely underestimating the collagen content of a sample. The most efficient way to solve this problem is to specify P+oxide as a variable modification when 'G.PG' is true and similarly K+oxide when 'G.KG' is true. The general variable modifications (P+oxide, K+oxide) isn't practical (it make a mess of the results). Similarly, it is also helpful (and efficient) when detecting the N deamidation caused by removal of an N-linked glycosylation to only check for deamidations at asparagines that satisfy N[^P][ST].

I also use the approach of using a list of possible modifications for each specific protein, both as variable modifications and as site specific modifications. Additionally, I normally test for a list of SAVs, specified in protein coordinates, whenever it is available for a species.

mvaudel commented 4 years ago

In addition, it should be possible to specify where the modification is attached on the motif. The format needs to specify that it is zero-based and what the default is. e.g. motif="[ST]" target=-2 would search modifications two amino acids before any S or T, which would be equivalent to motif="XX[ST]" with a default target of 0. motif="[ST]" target=1 would look for a modification after any S or T.

ypriverol commented 4 years ago

PTM site position ongoing discussion:

I will try to summarize the discussion about PMT parameter site, which is stoping the first PR https://github.com/PRIDE-Archive/pride-metadata-standard/pull/15 .

1- Target Amino acid (TA) (Proposed by @ypriverol)

TA=M

Target amino acid letter. If the modification target multiple sites, it should be provided as Target Regular Expression (TR).

Pros:

Cons:

2- Target Amino Acid as Regular expression (proposed by @RonBeavis @mvaudel ):

TA=N[^P][ST]

This proposal aims to represent all sites into a regular expression including motifs, etc.

Pros:

Cons:

Comments needed here to agree in one of the options: @mvaudel @mwalzer @RalfG @RonBeavis @prvst @trishorts .

RalfG commented 4 years ago

I tend to prefer option 2, as it is more comprehensive and correct. I agree that this option is more difficult for human submitters and human readers, but a well-designed submission form should be able to take these issues away for the common modifications.

I suspect that regex validators already exist for most programming languages?

mobiusklein commented 4 years ago

Option 2 still lacks a way to express which amino acid is the actual target. In this case, the N-glycosylation motif modifies the first amino acid (N), but this isn't guaranteed to be the case. The bacterial N-glycosylation motif has a prefix as well as a suffix around the modification site: [DE][^P]N[^P][ST].

To be able to use a regular expression, we would need to either A) specify capture group index, B) use named capture groups, or C) add a marker to the regular expression to indicate that an amino acid is the target.

The glycosaminoglycan linker glycosylation process preferentially targets S[GA]X[GA] where both S and X may be modified, but X should not be modified if S is not. There's plenty of poorly understood biology here, so we don't know the constraints on X.

If we have to use a capture group, then validation is more than just compiling the regular expression, but also testing that it contains a capture group? If we want to make trivial cases not require a capture group, check that the pattern cannot produces matches of length > 1?

ypriverol commented 4 years ago

Can we list a set of examples with the name of modifications and possible Regular expressions? @mvaudel @RonBeavis @mobiusklein @trishorts . I think it will help us to define more clearly option 2.

mobiusklein commented 4 years ago

Beyond glycosylation motifs, I do not know many that are "hard rules", and we stray into a gray area between blind combinatorial expansion rules vs. prescribed target sites from a database.

You can draw a few from PROSITE:

Phosphorylation https://prosite.expasy.org/PDOC00004 [RK]{2}.([ST]) https://prosite.expasy.org/PDOC00005 ([ST]).[RK] https://prosite.expasy.org/PDOC00006 ([ST])..[DE] https://prosite.expasy.org/PDOC00007 [RK].{2-3}[DE].{2-3}(Y)

N-myrisotylation https://prosite.expasy.org/PDOC00008 (G)[^EDRKHPFYW]..[STAGCN][^P]

Amidation https://prosite.expasy.org/PDOC00009 (.)G[RK]{2}

ypriverol commented 4 years ago

@mobiusklein :

This representation is more complex than I was thinking to represent because it also encode the information of the Enzyme. What do ou think @mvaudel @trishorts @RonBeavis

trishorts commented 4 years ago

I don't really have any comments about how you represent motifs. I like having motifs where they are appropriate. We don't use regex unless it can't be avoid.

trishorts commented 4 years ago

New topic. I'm no longer certain just what you are trying to capture here. I see two competing themes. One is an attempt to capture how a submitted data set WAS searched. And the other is to capture how a submitted data set SHOULD HAVE BEEN searched. I think there are some important considerations like those that Ron has mentioned earlier that will eliminate lots of false positives. But I see that as the job of the search engine and the original searchers. If someone does something "wrong" and submits those search results, I think its good to know how those wrong answers were produced. So, if someone searches for lysine acetylation everywhere (which is not correct), then I want to know that they did that so that I can question the results. If "we" require that acetylation be not allowed at tryptic peptide termini in the recording of the entry but the user had mistakenly allowed it, then there is problem. I don't have a recommendation but I see a collision.

ypriverol commented 4 years ago

Thanks for this comment @trishorts, I think in the document I make clear what is the original intention of these efforts.

New topic. I'm no longer certain just what you are trying to capture here. I see two competing themes. One is an attempt to capture how a submitted data set WAS searched.

1.- THIS IS THE MAIN INTENTION. The current metadata about experimental design is really poor into public databases including PRIDE. This problem makes really difficult data reuse and reproducibility. We want to provide a tab-delimited format that enriches the data submission process in two directions:

1.1- The file format should be able to provide information about the Experimental Design, sample metadata including Taxonomy, Tissues, etc. We are proposing SDRF because RNASeq has been using the format for more than 10 years and we have thousands and thousands of projects well-annotated; with no problems (including single-cell experiments). Using SDRF will enable us to and the proteomics community to move towards multiomics, annotating proteomics and transcriptomics experiments in the same way.

1.2- We need to provide sufficient information about the data analysis protocol to describe how the data was processed. This "protocol" description within the SDRF is specific to each field, in our case proteomics and we need to define some rules about how to capture it, including how to encode PTMs parameter search (this issue). The next discussion should be about Enzyme, Fragment tolerances, TMT Fragment ion masses, etc.

And the other is to capture how a submitted data set SHOULD HAVE BEEN searched. I think there are some important considerations like those that Ron has mentioned earlier that will eliminate lots of false positives. But I see that as the job of the search engine and the original searchers. If someone does something "wrong" and submits those search results, I think its good to know how those wrong answers were produced.

Agree.

So, if someone searches for lysine acetylation everywhere (which is not correct), then I want to know that they did that so that I can question the results. If "we" require that acetylation be not allowed at tryptic peptide termini in the recording of the entry but the user had mistakenly allowed it, then there is a problem. I don't have a recommendation but I see a collision.

By looking into most of the search engine parameters (MSGF+, Comet, UNIMOD) exposed to the users the following properties about a modification parameter: Accession or Name, Position [anywhere, C and N-term, Protein C and N-term], Composition, and Mass shifts or Monoisotopic mass.

The current PR https://github.com/PRIDE-Archive/pride-metadata-standard/pull/15 aim to define those first and more easy to define properties. In my opinion, the current definition of Amino Acid target AT should be only what aminoacids will be modified.

AT = S,T,Y  

Then, what I named now TR Target regular expression should be to define more complex structures. I see now that SearcGUI (@mvaudel) use Pattern Design defined as Target AA and Excluded AA.

If we accept the current proposal PR https://github.com/PRIDE-Archive/pride-metadata-standard/pull/15 , then we can clearly discuss how to encode into regular expressions the full information of PTMS parameters.

trishorts commented 4 years ago

as I undertand it then we need a target REGEX that will capture what was searched including motifs and that "we" shouldn't block any motif/PTM combos. So, if someone search variable phosphorylation on say alanine, then AT = A, that's what we want to know. I think this clarifies everything for me. Thanks. BTW, I couldn't begin to construct such a REGEX.

mobiusklein commented 4 years ago

Splitting modification specification into "amino acid target" TA and a "constraint pattern" TR where appropriate seems reasonable. Specifying everything as a regex would be difficult, especially since there are so many ways to write the same pattern.

Is the intent of this experimental design section to capture all modifications, or only variable modifications? Should open search engines and multi-round search engines include all the modifications they could consider or in some way communicate the range of "dark mass" they allow?

We've talked about glycosylation site motifs, how about glycans themselves? When you look at PRIDE's glycoproteomics entries, they do not explicitly specify that the study looked at glycopeptides, and what the glycan database was. Depending upon what you're looking for, that can be anywhere from five to over nine thousand different glycans, represented at different levels of specificity. Is this something to capture in this one-line description scheme?

Repeat above for cross-linked peptide experiments?

ypriverol commented 4 years ago

@trishorts:

as I undertand it then we need a target REGEX that will capture what was searched including motifs and that "we" shouldn't block any motif/PTM combos. So, if someone search variable phosphorylation on say alanine, then AT = A, that's what we want to know. I think this clarifies everything for me.

Can you review the following PR https://github.com/PRIDE-Archive/pride-metadata-standard/pull/15 ? I did minor changes to reflect the latest discussion.

The only thing is pending is that modifications that affect N and C term positions, not aminoacids, how to define them. I like the UNIMOD definition N-term and C-term.

RalfG commented 4 years ago

@ypriverol:

The only thing is pending is that modifications that affect N and C term positions, not amino acids, how to define them. I like the UNIMOD definition N-term and C-term.

If we are talking about modifications targeting the N-term NH2- or the C-term -COOH, I think N-term and C-term would be good ways to describe them. If we are talking about PTMs specifically targeting the side-chain of an N-term/C-term amino acid, I would go for ., * or any in combination with the PP (polypeptide position) key.

Mass shift-wise, this does not really matter. But I guess for "blocking" the sites in the search space, it could, in theory, make a difference.

ypriverol commented 4 years ago

Splitting modification specification into "amino acid target" TA and a "constraint pattern" TR where appropriate seems reasonable.

OK

Specifying everything as a regex would be difficult, especially since there are so many ways to write the same pattern.

I will open a new issue about that, to discuss possible implementations. In the current PR https://github.com/PRIDE-Archive/pride-metadata-standard/pull/15 that definition is pending until we have a decision.

Is the intent of this experimental design section to capture all modifications, or only variable modifications?

Variable and fixed modifications define as parameters in the search. See the definition in the PR https://github.com/PRIDE-Archive/pride-metadata-standard/pull/15

Should open search engines and multi-round search engines include all the modifications they could consider or in some way communicate the range of "dark mass" they allow?

For "dark modifications" we can use a name Unknown modification and mass shift and all possible amino acids.

We've talked about glycosylation site motifs, how about glycans themselves? When you look at PRIDE's glycoproteomics entries, they do not explicitly specify that the study looked at glycopeptides, and what the glycan database was. Depending upon what you're looking for, that can be anywhere from five to over nine thousand different glycans, represented at different levels of specificity. Is this something to capture in this one-line description scheme?

We clarify already that for large scale annotation of PTMs search we should use database annotations like PEEF.

trishorts commented 4 years ago

@trishorts:

as I undertand it then we need a target REGEX that will capture what was searched including motifs and that "we" shouldn't block any motif/PTM combos. So, if someone search variable phosphorylation on say alanine, then AT = A, that's what we want to know. I think this clarifies everything for me.

Can you review the following PR #15 ? I did minor changes to reflect the latest discussion.

The only thing is pending is that modifications that affect N and C term positions, not aminoacids, how to define them. I like the UNIMOD definition N-term and C-term.

I'm on board with this