Closed RalfG closed 1 year ago
Are interested parties supposed to make suggestions here or in the working document?
Either one is fine. If your comment is easy to express as an issue in this venue, then that's great. If your comments are better expressed in the context of the document, it is fine to make comments directly in the document.
OK, so:
Currently
y5/0.002
or y2+CO-H2O/1.1ppm
f{C6H12O6}/0.002
s{c1ccc(cc1)O+}/0.001
0@
r[TMT121]
with square bracketsp-TMT121
I guess this all works, but it is becoming quite the decision tree.
Could this not be unified a little bit? The f{}
notation already suggests a way to do that: to use annotationtype
{
annotation
}
for everything, both in the ion type and in the neutral loss. E.g.
i{}
e.g. i{L}
, i{C[Carbamidomethyl]}
o{}
e.g. o{Adenosine}
(I am not a fan of names from the point of view of machine readability, but I see the need)r{}
e.g. r{TMT121}
m{3:6}
For the peptide fragment ion types
This is certainly the most relevant for most users here... I can think of three variants off the top of my head
y13
, b3
etc; probably the most convenient because people are used to it (which is a shame from the parsing point of view, but OK, I am the dwindling minority here as a metabolomics person)y{13}
, b{3}
n{y13}
, this could also encompass the m ions (n{m3:6}
)etc, and the same rules are used for the ion type and the neutral loss. Certainly the letters and precise choices would need some discussion. Note that this makes everything more extendable since we are not limited to using single letters for the annotationtype. Especially for less important cases one could consider e.g. impurity{Adenosine}
Examples for combinations
p-r{TMT121}
f{C6H12O6}
p-f{CO2}-f{H2O}
y13+f{CO}
or n{y13}+f{CO}
or y{13}+f{CO}
Extra suggestions
This could also approach the issue of the sidepeaks currently denoted with square brackets: instead, the side peak could be mapped to the main peak with map{}
or peak{}
or such:
Original example:
677.299 572 [y7/-0.001]
677.300 5681 y7/0.000
677.301 1320 [y7/0.001]
New suggestion
677.299 572 peak{677.300}/-0.001
677.300 5681 y7/0.000
677.301 1320 peak{677.300}/0.001
Note that this allows the apex of the peak to be marked with the correct interpretation, rather than the peak entry closest to the mass. So this could be y7/0.002
if the true mass of y7 is 677.298.
This could go even further. Currently the /
separates precision from interpretation, then there is *
to indicate confidence, and ,
separates multiple interpretations of a peak. Precision is parsed ad-hoc as m/z value or ppm value. There is also no way to add any extra comment to a single interpretation, or to easily extend the specification. This could be remedied by using
,
to separate fields within an interpretation;
to separate alternative interpretationsdelta{0.2ppm}
, delta{0.001}
or delta{0.001mz}
for precisionconfidence{0.88}
for confidencenote{Arbitrary note}
for arbitrary notesimpurity{}
. Instead of the abovementioned o{}
, we would then introduce a "trivial name" ion type called name{}
.
charge{2}
could likewise be discussed.To give examples:
page 17 original:
677.302 240 [y7/0.002]
677.303 34 b6-H2O/-0.005,[y7/0.003]
new:
677.302 240 y7,delta{0.002mz}
677.303 34 b6-f{H2O},delta{-0.005mz}; peak{677.300},delta{0.003mz}
page 18 orig:
y12/3.4ppm*0.85,b9-NH3/5.2ppm*0.05
new:
y12,delta{3.4ppm},confidence{0.85}; b9-f{NH3},delta{5.2ppm},confidence{0.05}
(or correspondingly n{y12}
etc
page 12 orig:
0@_Adenosine
new:
name{Adenosine},impurity{}
Overall the goal of my proposed modifications is to make complex annotations more easily readable, both to humans and machines. The drawback, if you want to call it one, is increased verbosity. But computers don't care about verbosity, as long as it's well specified. Less suffixes and prefixes. Space to integrate e.g. lipids easily: goslin{}
https://apps.lifs.isas.de/goslin/
Hi @meowcat thanks for this well-reasoned alternative. I will summarize here in a table what I see as a translation table between the current proposal and your proposed alternative:
Current proposed spec More verbose alternative
y2/4.3ppm peptide{y2},delta{4.3ppm}
y4^2/4.3ppm peptide{y2},delta{4.3ppm},charge{2}
z4+i^3/3.3ppm peptide{z4},isotope{1},charge{3},delta{3.3ppm}
b3-H2O/0.002 peptide{b3},formula{-H2O},delta{0.002mz}
2@p-NH3/1.4ppm precursor{},f{-NH3},delta{1.4ppm},analyte{2}
IH+CO/0.008 immonium{H},f{CO},delta{0.008mz}
IC[Carbamidomethyl]/1.8ppm immonium{C[Carbamidomethyl]},delta{1.8ppm}
0@_Adenosine/0.6ppm name{Adenosine},analyte{0},delta{0.6ppm}
m3:6-CO/3.2ppm internal{3:6},formula{CO}/3.2ppm
? unknown{},comment{Probably contamination}
r[TMT127N]/0.0007 reporter{TMT127N},delta{0.0007mz}
p-[iTRAQ114]-CO/8.4ppm precursor{},reporter{-TMT127N},formula{-CO},delta{8.4ppm}
y12/3.4ppm*0.85,b9-NH3/5.2ppm*0.05 peptide{y12},delta{3.4ppm},confidence{0.85};peptide{b9},formula{-NH3},delta{5.2ppm},confidence{0.05}
[y7/-0.001] peptide{y7},delta{0.001mz},primary_peak{677.300}
G???? glyan{????}
L???? lipid{????}
X???? xlink{????}
S????/0.002 smiles{c1ccc(cc1)O},delta{0.002mz}
What do others think?
I think we will likely discuss this in depth in the call this coming Friday. Would you join us, @meowcat ?
Hi, sorry I wasn't replying, I was gone last Friday. If you are still discussing this, I would participate.
Note that since this suggestet format is vaguely approaching JSON in optics and scope, another idea would to make it JSON entirely. {peptide: b3, delta: 4.3ppm, etc}
but then strictly strings need to go in quotes: {peptide: "b3", delta: "4.3ppm", etc}
Perhaps a step too much. As an alternative, YAML is less strict there: [peptide: b3, delta: 4.3ppm, etc]
. But proprietary is fine too, as long as we don't limit ourselves inadvertently.
Some notes to your enhanced suggestion:
precursor{},f{-H2O},delta{2ppm}
but also smiles{c1ccccc1.[H+]},delta{2ppm}
. (Note: it would have to be specified how the charged forms of SMILES should be handled). With operators this could be precursor{}-f{H2O},smiles{c1ccccc1.[H+]},delta{2ppm}
delta{-0.001mz}
relates to the annotation y7 or to the mass of the primary_peak
. The point marked as primary peak (say, the apex or centroid of a profile mode peak) may have a mass shift to the correct annotation (say 1 ppm), but the profile points have a shift to the main peak. I personally would just specify what main peak the peak belongs to, since this is sufficient to categorize it.The idea of using these predicates/operators to describe the annotations looks like a good way to break out of the conflicting "annotation style" issue between domains, and it does go some ways towards improving machine parse-ability while retaining human readability. On the other hand, it makes the common use-cases use a lot more space as we now have to explicitly tag every attribute.
I don't think it's reasonable to do away with arbitrary arithmetic expressions. While you could argue that precursor{}-f{H2O}-f{H2O}
is better written precursor{}-f{H4O2}
, it fails to capture scenarios where the loss is not only a formula, such as precursor{}-[Phospho]
or precursor{}-f{H2O}-[Phospho]
. While those complex loss scenarios aren't the majority, they aren't uncommon either.
If we do use predicates, we would need to specify what each predicate meant, and whether certain predicates can go "together", for instance if you use peptide
can you also use smiles
in the same annotation. Further, how would implementers be expected to cope with new predicates being introduced? One approach might be to define each predicate externally, and then each predicate would need to be looked up, but expressing that relational concept in an ontological format might be difficult without a well defined schema, or need the parser to be "intelligent" in a way that makes adding new predicates difficult.
One compromise would be to keep using "annotation styles" but just text-encode annotations using the predicate format instead of the compact notation, but this doesn't mandate anything for binary formats where what would be saved could be the annotation data, not necessarily the text-encoding of the annotation itself.
I do think that the extensibility idea is a good one though. I apologize for the overly negative tone of this post as it is written in haste.
I mean, in principle the predicate{}
suggestion is just another annotation style. I just think it makes sense to make an annotation style that is useful for a broad range of purposes, and I think it is feasible to formulate a set of predicates that covers a lot of ground. Then it is however still extensible (and implementations can possibily still read and roundtrip extra tags that they don't understand, just not interpret them appropriately)
Say we gather a solid base set of what we currently think is needed and call this annotation-style:core-1.0
, whoever feels they need additional fields (say, for some internal fields calculated by a software suite) could make an annotation-style:extension-0.2
which inherits from annotation-style:core-1.0
? If important new features emerge, they can later be incorporated into core-2.0
...
(But I'm not good at the technical part of ontologies, so others might disagree on how this should be done.)
and whether certain predicates can go "together", for instance if you use
peptide
can you also usesmiles
in the same annotation.
My feeling here is that overspecifying things will not help. In the end, what is the purpose of these annotations? 1) an interpretation aid for the reader, 2) an interpretation aid for software, 1.5) an interpretation aid for the reader that is visualized by software, 3) something else? (In my opinion I wouldn't see a reason why I can't have peptide{y2}-smiles{c1ccccc1}
for some benzene loss off some peptide.)
but this doesn't mandate anything for binary formats where what would be saved could be the annotation data, not necessarily the text-encoding of the annotation itself.
The same goes for any other annotation (like the compact format) though; I actually see advantages for more streamlined binary serialization with the predicate format over the compact format.
On the other hand, it makes the common use-cases use a lot more space as we now have to explicitly tag every attribute.
Yes, that's certainly true. For visualization in software this can be circumvented, but in text-format-serialized records it will stay bulky. A shorthand like p{}
or even possibly no-prefix {}
(e.g. {y1}
) for at least the simplest peptide case might be useful.
Hi all, I see f{C6H12O6} made it into the specification - any chance we can see s{SMILES} for known substructures? Would greatly enhance the generality of the format. Otherwise we have a big gap between "peptide" and "formula" that could IMO be avoided.
@meowcat I'm not familiar enough with SMILES to say this with certainty, but I think it uses curly braces to denote charge, which may or may not make s{SMILES}
too irregular for our existing regex parser. I might be able to bend the pattern around this problem, but is the charge feature used much or would the global charge of the peak annotation be enough information?
https://www.daylight.com/dayhtml/doc/theory/theory.smiles.html
https://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system
I am not aware of curly braces in SMILES. However, brackets are used frequently. Not sure if that would be a problem. Off the top of my head, A-Za-z0-9+-=#()[]@\/.
(some of them only relevant for stereochemistry). Ah yes, %
also for ring sizes >10.
Charges are expressed like [Na+][Cl-]
.
I'm finding this regex which takes into account that J
doesn't appear in the periodic table, it looks right but I don't have an authoritative answer. Google tells me that $
is for the quadruple bond. I don't remember ever seeing this in the wild.
/^([^J][A-Za-z0-9@+\-\[\]\(\)\\\/%=#$]+)$/
https://gist.github.com/lsauer/1312860/264ae813c2bd2c27a769d261c8c6b38da34e22fb#file-_smiles_inchi_annotated-js-L12
There is an extension called "ChemAxon Extended SMILES" (CxSMILES) where curly brackets are used for R-group description, but this is far outside the proposed scope.
We discussed this at the last weekly call, and agreed to add s{SMILES}
support to the annotation format specification. The only special character that we have to worry about is "}" because of the regex we're using to portion up the annotation string. I had been mislead by reading an old/non-authoritative reference. @hechth set the record straight
Returning to the SMILES charge specification, we concluded that the expected net charge of the ion would be written as part of the peak annotation format, but the writer is free to specify any local charges though not all readers will know what to do with them.
This has been included in mzPAF specification currently under community review. If there are further concerns, report them based on current version:
See working document for ongoing discussion.