Peak interpretation format

HUPO-PSI / mzSpecLib

mzSpecLib: A standard format to exchange/distribute spectral libraries

https://hupo-psi.github.io/mzSpecLib/

Apache License 2.0

24 stars 14 forks source link

Peak interpretation format #23

Closed RalfG closed 1 year ago

RalfG commented 4 years ago

As part of the new PSI spectral library format, it will be possible to annotate the interpretations of individual peaks, as is already done in NIST, SpectraST, and PeptideAtlas libraries. However, there have been several different styles of interpretations in the past (even from a single provider), and therefore this document describes a single common peak interpretation format for peptides that is recommended for all peptide libraries and related applications from which peak interpretations are desirable.

This format, as currently described, is designed for unbranched peptides with simple PTMs and for fragmentation methods commonly used in proteomics such as CID, HCD and ETD. Although there are some provisions for annotating small molecules (e.g., contaminants in a predominantly peptide spectrum), as well as unusual fragments, it is expected that for other major classes of analytes (metabolites, glycans, glycopeptides, cross-linked peptides...), alternative peak interpretation formats should be defined.

See working document for ongoing discussion.

meowcat commented 4 years ago

Are interested parties supposed to make suggestions here or in the working document?

edeutsch commented 4 years ago

Either one is fine. If your comment is easy to express as an issue in this venue, then that's great. If your comments are better expressed in the context of the document, it is fine to make comments directly in the document.

meowcat commented 4 years ago

OK, so:

Currently

peptides are y5/0.002 or y2+CO-H2O/1.1ppm
formulas are f{C6H12O6}/0.002
there is no suggestion yet for substructural annotation with SMILES (or did I miss it?); I would suggest e.g. s{c1ccc(cc1)O+}/0.001
Impurities start with 0@
Reporter ions are denoted r[TMT121] with square brackets
but as neutral loss they are written without the r[] e.g. p-TMT121
immonium ions start with an uppercase I e.g. IC
Square brackets are also used for combining multiple peaks from the same fragment ion

I guess this all works, but it is becoming quite the decision tree.

Could this not be unified a little bit? The f{} notation already suggests a way to do that: to use annotationtype{annotation} for everything, both in the ion type and in the neutral loss. E.g.

Immonium ions are i{} e.g. i{L}, i{C[Carbamidomethyl]}
Impurities are o{} e.g. o{Adenosine} (I am not a fan of names from the point of view of machine readability, but I see the need)
Reporters are r{} e.g. r{TMT121}
internal ions are m{3:6}

For the peptide fragment ion types

This is certainly the most relevant for most users here... I can think of three variants off the top of my head

either those are left as is, i.e. y13, b3 etc; probably the most convenient because people are used to it (which is a shame from the parsing point of view, but OK, I am the dwindling minority here as a metabolomics person)
or they are individual ion types, i.e. y{13}, b{3}
or there is a peptide fragment ion type, I'm running out of letters here, but say n{y13}, this could also encompass the m ions (n{m3:6})

etc, and the same rules are used for the ion type and the neutral loss. Certainly the letters and precise choices would need some discussion. Note that this makes everything more extendable since we are not limited to using single letters for the annotationtype. Especially for less important cases one could consider e.g. impurity{Adenosine}

Examples for combinations

a loss of TMT121 is p-r{TMT121}
a fragment specified by formula is f{C6H12O6}
a neutral loss specified by formula is p-f{CO2}-f{H2O}
For the peptide ions with losses, any of the notations e.g. y13+f{CO} or n{y13}+f{CO} or y{13}+f{CO}

Extra suggestions

This could also approach the issue of the sidepeaks currently denoted with square brackets: instead, the side peak could be mapped to the main peak with map{} or peak{} or such:

Original example:

677.299    572         [y7/-0.001]
677.300    5681        y7/0.000
677.301    1320        [y7/0.001]

New suggestion

677.299    572         peak{677.300}/-0.001
677.300    5681        y7/0.000
677.301    1320        peak{677.300}/0.001

Note that this allows the apex of the peak to be marked with the correct interpretation, rather than the peak entry closest to the mass. So this could be y7/0.002 if the true mass of y7 is 677.298.

meowcat commented 4 years ago

This could go even further. Currently the / separates precision from interpretation, then there is * to indicate confidence, and , separates multiple interpretations of a peak. Precision is parsed ad-hoc as m/z value or ppm value. There is also no way to add any extra comment to a single interpretation, or to easily extend the specification. This could be remedied by using

, to separate fields within an interpretation
; to separate alternative interpretations
delta{0.2ppm}, delta{0.001} or delta{0.001mz} for precision
confidence{0.88} for confidence
note{Arbitrary note} for arbitrary notes
the impurity flag can be separated from the interpretation of the impurity, e.g. impurity{}. Instead of the abovementioned o{}, we would then introduce a "trivial name" ion type called name{}.
- Charge charge{2} could likewise be discussed.

To give examples:

page 17 original:

677.302    240       [y7/0.002]
677.303    34        b6-H2O/-0.005,[y7/0.003]

new:

677.302    240       y7,delta{0.002mz}
677.303    34        b6-f{H2O},delta{-0.005mz}; peak{677.300},delta{0.003mz}

page 18 orig:

y12/3.4ppm*0.85,b9-NH3/5.2ppm*0.05

new:

y12,delta{3.4ppm},confidence{0.85}; b9-f{NH3},delta{5.2ppm},confidence{0.05}

(or correspondingly n{y12} etc

page 12 orig:

0@_Adenosine

new:

name{Adenosine},impurity{}

Overall the goal of my proposed modifications is to make complex annotations more easily readable, both to humans and machines. The drawback, if you want to call it one, is increased verbosity. But computers don't care about verbosity, as long as it's well specified. Less suffixes and prefixes. Space to integrate e.g. lipids easily: goslin{} https://apps.lifs.isas.de/goslin/

edeutsch commented 4 years ago

Hi @meowcat thanks for this well-reasoned alternative. I will summarize here in a table what I see as a translation table between the current proposal and your proposed alternative:

Current proposed spec       More verbose alternative
y2/4.3ppm           peptide{y2},delta{4.3ppm}
y4^2/4.3ppm         peptide{y2},delta{4.3ppm},charge{2}
z4+i^3/3.3ppm           peptide{z4},isotope{1},charge{3},delta{3.3ppm}
b3-H2O/0.002            peptide{b3},formula{-H2O},delta{0.002mz}
2@p-NH3/1.4ppm          precursor{},f{-NH3},delta{1.4ppm},analyte{2}
IH+CO/0.008         immonium{H},f{CO},delta{0.008mz}
IC[Carbamidomethyl]/1.8ppm  immonium{C[Carbamidomethyl]},delta{1.8ppm}
0@_Adenosine/0.6ppm     name{Adenosine},analyte{0},delta{0.6ppm}
m3:6-CO/3.2ppm          internal{3:6},formula{CO}/3.2ppm
?               unknown{},comment{Probably contamination}
 r[TMT127N]/0.0007      reporter{TMT127N},delta{0.0007mz}
p-[iTRAQ114]-CO/8.4ppm      precursor{},reporter{-TMT127N},formula{-CO},delta{8.4ppm}
y12/3.4ppm*0.85,b9-NH3/5.2ppm*0.05  peptide{y12},delta{3.4ppm},confidence{0.85};peptide{b9},formula{-NH3},delta{5.2ppm},confidence{0.05}
[y7/-0.001]         peptide{y7},delta{0.001mz},primary_peak{677.300}
G????               glyan{????}
L????               lipid{????}
X????               xlink{????}
S????/0.002         smiles{c1ccc(cc1)O},delta{0.002mz}

What do others think?

I think we will likely discuss this in depth in the call this coming Friday. Would you join us, @meowcat ?

meowcat commented 4 years ago

Hi, sorry I wasn't replying, I was gone last Friday. If you are still discussing this, I would participate.

Note that since this suggestet format is vaguely approaching JSON in optics and scope, another idea would to make it JSON entirely. {peptide: b3, delta: 4.3ppm, etc} but then strictly strings need to go in quotes: {peptide: "b3", delta: "4.3ppm", etc} Perhaps a step too much. As an alternative, YAML is less strict there: [peptide: b3, delta: 4.3ppm, etc]. But proprietary is fine too, as long as we don't limit ourselves inadvertently.

Some notes to your enhanced suggestion:

you do away with the addition and subtraction as discrete operators, which is probably even cleaner; there are pros and cons to this. A pro is that there are less parser tokens. A possible con is that I can't specify two descriptions that specify the same structure for one peak in one annotation: E.g. Phenol [M+H]+ minus H2O with 2 ppm shift is precursor{},f{-H2O},delta{2ppm} but also smiles{c1ccccc1.[H+]},delta{2ppm}. (Note: it would have to be specified how the charged forms of SMILES should be handled). With operators this could be precursor{}-f{H2O},smiles{c1ccccc1.[H+]},delta{2ppm}
For the case [y7/-0.001] this can be confusing, since it is not clear whether delta{-0.001mz} relates to the annotation y7 or to the mass of the primary_peak. The point marked as primary peak (say, the apex or centroid of a profile mode peak) may have a mass shift to the correct annotation (say 1 ppm), but the profile points have a shift to the main peak. I personally would just specify what main peak the peak belongs to, since this is sufficient to categorize it.

mobiusklein commented 4 years ago

The idea of using these predicates/operators to describe the annotations looks like a good way to break out of the conflicting "annotation style" issue between domains, and it does go some ways towards improving machine parse-ability while retaining human readability. On the other hand, it makes the common use-cases use a lot more space as we now have to explicitly tag every attribute.

I don't think it's reasonable to do away with arbitrary arithmetic expressions. While you could argue that precursor{}-f{H2O}-f{H2O} is better written precursor{}-f{H4O2}, it fails to capture scenarios where the loss is not only a formula, such as precursor{}-[Phospho] or precursor{}-f{H2O}-[Phospho]. While those complex loss scenarios aren't the majority, they aren't uncommon either.

If we do use predicates, we would need to specify what each predicate meant, and whether certain predicates can go "together", for instance if you use peptide can you also use smiles in the same annotation. Further, how would implementers be expected to cope with new predicates being introduced? One approach might be to define each predicate externally, and then each predicate would need to be looked up, but expressing that relational concept in an ontological format might be difficult without a well defined schema, or need the parser to be "intelligent" in a way that makes adding new predicates difficult.

One compromise would be to keep using "annotation styles" but just text-encode annotations using the predicate format instead of the compact notation, but this doesn't mandate anything for binary formats where what would be saved could be the annotation data, not necessarily the text-encoding of the annotation itself.

I do think that the extensibility idea is a good one though. I apologize for the overly negative tone of this post as it is written in haste.

meowcat commented 4 years ago

I mean, in principle the predicate{} suggestion is just another annotation style. I just think it makes sense to make an annotation style that is useful for a broad range of purposes, and I think it is feasible to formulate a set of predicates that covers a lot of ground. Then it is however still extensible (and implementations can possibily still read and roundtrip extra tags that they don't understand, just not interpret them appropriately)

Say we gather a solid base set of what we currently think is needed and call this annotation-style:core-1.0, whoever feels they need additional fields (say, for some internal fields calculated by a software suite) could make an annotation-style:extension-0.2 which inherits from annotation-style:core-1.0? If important new features emerge, they can later be incorporated into core-2.0...

(But I'm not good at the technical part of ontologies, so others might disagree on how this should be done.)

and whether certain predicates can go "together", for instance if you use peptide can you also use smiles in the same annotation.

My feeling here is that overspecifying things will not help. In the end, what is the purpose of these annotations? 1) an interpretation aid for the reader, 2) an interpretation aid for software, 1.5) an interpretation aid for the reader that is visualized by software, 3) something else? (In my opinion I wouldn't see a reason why I can't have peptide{y2}-smiles{c1ccccc1} for some benzene loss off some peptide.)

but this doesn't mandate anything for binary formats where what would be saved could be the annotation data, not necessarily the text-encoding of the annotation itself.

The same goes for any other annotation (like the compact format) though; I actually see advantages for more streamlined binary serialization with the predicate format over the compact format.

On the other hand, it makes the common use-cases use a lot more space as we now have to explicitly tag every attribute.

Yes, that's certainly true. For visualization in software this can be circumvented, but in text-format-serialized records it will stay bulky. A shorthand like p{} or even possibly no-prefix {} (e.g. {y1}) for at least the simplest peptide case might be useful.

meowcat commented 1 year ago

Hi all, I see f{C6H12O6} made it into the specification - any chance we can see s{SMILES} for known substructures? Would greatly enhance the generality of the format. Otherwise we have a big gap between "peptide" and "formula" that could IMO be avoided.

mobiusklein commented 1 year ago

@meowcat I'm not familiar enough with SMILES to say this with certainty, but I think it uses curly braces to denote charge, which may or may not make s{SMILES} too irregular for our existing regex parser. I might be able to bend the pattern around this problem, but is the charge feature used much or would the global charge of the peak annotation be enough information?

meowcat commented 1 year ago

https://www.daylight.com/dayhtml/doc/theory/theory.smiles.html https://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system I am not aware of curly braces in SMILES. However, brackets are used frequently. Not sure if that would be a problem. Off the top of my head, A-Za-z0-9+-=#()[]@\/. (some of them only relevant for stereochemistry). Ah yes, % also for ring sizes >10.

Charges are expressed like [Na+][Cl-].

I'm finding this regex which takes into account that J doesn't appear in the periodic table, it looks right but I don't have an authoritative answer. Google tells me that $ is for the quadruple bond. I don't remember ever seeing this in the wild. /^([^J][A-Za-z0-9@+\-\[\]\\\/%=#$]+)$/ https://gist.github.com/lsauer/1312860/264ae813c2bd2c27a769d261c8c6b38da34e22fb#file-_smiles_inchi_annotated-js-L12

There is an extension called "ChemAxon Extended SMILES" (CxSMILES) where curly brackets are used for R-group description, but this is far outside the proposed scope.

mobiusklein commented 1 year ago

We discussed this at the last weekly call, and agreed to add s{SMILES} support to the annotation format specification. The only special character that we have to worry about is "}" because of the regex we're using to portion up the annotation string. I had been mislead by reading an old/non-authoritative reference. @hechth set the record straight

Returning to the SMILES charge specification, we concluded that the expected net charge of the ion would be written as part of the peak annotation format, but the writer is free to specify any local charges though not all readers will know what to do with them.

edeutsch commented 1 year ago

This has been included in mzPAF specification currently under community review. If there are further concerns, report them based on current version:

https://psidev.info/mzPAF