Closed lauperbe closed 8 months ago
This looks like an issue on the CDK side, where it explicitly ignores mass numbers when generating the string formula
If you update to the latest github versions of rcdklibs
and rcdk
, the resultant formulae strings are now labeled with mass numbers.
Also generate.formula.iter
is updated to take the same elements argument as generate.formula
. So your updated code looks like
library(rJava)
library(reshape2)
library(stringi)
library(rcdk)
library(RMassBank)
#parent substance (known)
target_name <- "3,5-dibromo-4-hydroxybenzoic_acid_804"
target_formula <- "C7H4Br2O3"
#masses of peaks to analyze (first is monoisotopic parent, next two are isotopologues and last is a possible in-source-fragment)
target_peaks <- c(292.8454296, 294.8432039, 296.8411527, 135.0452735)
subformula <- c()
elements <- lapply(formulastring.to.list(target_formula), range, 0) #gives me a list to limit the formula generation
for (i in names(elements)) {
tmp <- elements[[i]]
tmp <- c(i, tmp)
elements[[i]] <- tmp
}
elements[[5]] <- c("Br",0,2,81)
results <- lapply(target_peaks, function(tp) {
mit <- generate.formula.iter(target_peaks[tp], window = 0.05, elements, charge = 1, as.string=FALSE)
hit <- itertools::ihasNext(mit)
as.list(hit)
})
result <- c()
for (j in 1:length(target_peaks)){
result<-c()
mit <- generate.formula.iter(target_peaks[j], window = 0.01, elements, charge = 1, as.string=TRUE)
hit <- itertools::ihasNext(mit)
while (itertools::hasNext(hit))
result <- iterators::nextElem(hit)
if(!is.null(result)){ # writes found formulae into vector
subformula[j] <- result
}else{subformula[j]<-NA}
}
subformula
Thank you for your answer. This was exactly what I was looking for. Unfortunately, the output of the function is now not compatible anymore with further rcdk analysis. If I run the updated script, the output reads (> subformula [1] "[12C]7[1H]3[79Br]2[16O]3" "[12C]7[1H]3[79Br][81Br][16O]3" "[12C]7[1H]3[81Br]2[16O]3" "none" )
But if I now try to generate a Rcdk formula element from the output via get.formula(subformula[1],1) I get the error: Error in .jcall(manipulator, "Lorg/openscience/cdk/interfaces/IMolecularFormula;", : java.lang.NullPointerException
Ah yes - the rest of the CDK code doesn't recognize the mass number annotated formulae. One workaround for now is to tell the generator to return formula objects rather than strings.
mit <- generate.formula.iter(target_peaks[j], window = 0.01, elements, charge = 1, as.string=FALSE)
So, doing this gives you something like
> subformula <- list()
> for (j in 1:length(target_peaks)){
+ result<-c()
+ mit <- generate.formula.iter(target_peaks[j], window = 0.01, elements, charge = 1, as.string=FALSE)
+ hit <- itertools::ihasNext(mit)
+ while (itertools::hasNext(hit))
+ result <- iterators::nextElem(hit)
+ if(is.null(result)==F){ # writes found formulae into vector
+ subformula[[j]] <- result
+ }else{
+ subformula[[j]]<- "none"
+ }
+ }
> subformula
[[1]]
[1] "Java-Object{org.openscience.cdk.formula.MolecularFormula@67f89fa3}"
[[2]]
[1] "Java-Object{org.openscience.cdk.formula.MolecularFormula@4ac68d3e}"
[[3]]
[1] "Java-Object{org.openscience.cdk.formula.MolecularFormula@277c0f21}"
[[4]]
[1] "none"
You could then manipulate the formula objects using CDK classes/methods via .jcall
. It's a bit klunky, but until we update the CDK side of things, this would be the best way
Also, on a somewhat unrelated note, the elements
list looks like
> elements
$C
[1] "C" "0" "7"
$H
[1] "H" "0" "4"
$Br
[1] "Br" "0" "2"
$O
[1] "O" "0" "3"
[[5]]
[1] "Br" "0" "2" "81"
For the entries where mass number is not specified, is it expected or assumed that the major isotope is to be used?
For the not specified mass numbers, the major isotope is assumed and cdk also uses it like this.
Thanks for the answers
How is the major isotope defined? The isotope of maximum intensity, or a la the InChI definition (the rounded average atomic mass of the element) – which is not the same and leads to confusion for elements such as Se and Sn? For Sn this would mean either 120Sn (max abundance of 32%) or 119Sn (rounded from average mass of 118.71) is the reference. Is there a way to make the isotope explicit to avoid confusion/assumptions?
@lauperbe you may also want to check out how compatible this all is with enviPat. This surely handles isotopes because they have to; we should try and get rcdk and enviPat annotating formulas in consistent ways that follow clearly defined chemical conventions if possible (a lot of my/our code uses both packages)
(> subformula [1] "[12C]7[1H]3[79Br]2[16O]3" "[12C]7[1H]3[79Br][81Br][16O]3" "[12C]7[1H]3[81Br]2[16O]3" "none" )
If I can jump in here, I also thought the standard notation for identifying less common isotopes is an integer value wrapped in square brackets prefixing the elemental symbol, e.g. in the example
[12C]7[1H]3[79Br]2[16O]3
as 79Br is the only element with a less common isotope, this should be implicitly written as
C7H3[79]Br2O3
enviPat and (and also commercial MS software packages with formula generators / interpreters) recognize this. They do not recognize C7H3[79Br]2O3
thanks Tony
On 20 April 2018 at 08:05, Emma Schymanski notifications@github.com wrote:
@lauperbe you may also want to check out how compatible this all is with enviPat. This surely handles isotopes because they have to; we should try and get rcdk and enviPat annotating formulas in consistent ways that follow clearly defined chemical conventions if possible (a lot of my/our code uses both packages)
(> subformula [1] https://maps.google.com/?q=1%5D++%5B12C%5D7&entry=gmail&source=g" [12C]7 https://maps.google.com/?q=1%5D++%5B12C%5D7&entry=gmail&source=g[1H]3[79Br]2[16O]3" "[12C]7[1H]3[79Br][81Br][16O]3" "[12C]7[1H]3[81Br]2[16O]3" "none" )
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/rajarshi/cdkr/issues/65#issuecomment-383001848, or mute the thread https://github.com/notifications/unsubscribe-auth/ADeo0l6Kc0RmClUGILEpi7s7zmY_W2Ynks5tqYi5gaJpZM4TWWKm .
-- Dr. Tony R. Larson Head of Metabolomics & Proteomics, Department of Biology, Area 15 University of York Wentworth Way Heslington York YO10 5DD UK
Tel: +44(0)1904 328 733 (office) Tel: +44(0)7833 471 685 (mobile)
tony.larson@york.ac.uk http://scholar.google.com/citations?user=9hLFka4AAAAJ www.york.ac.uk/biology/technology-facility/proteomics/ www.york.ac.uk/mass-spectrometry
@trljcl thanks for jumping in; have you found any links to information defining the actual conventions? I have just emailed a colleague if he knows any (as we just debated this at great length for InChI specs)
@ChemConnector we need the ACS Style guide open! ... I can't find info in Wikipedia and the InChI specs don't cover this ... This does not cover computational representations for formulae well as far as I can see: https://en.wikipedia.org/wiki/Chemical_formula
No, sorry - I am a primarily a small molecule biologist who uses various packages to annotate monoisotopic MS data. All software I've come across use the [i]symbol notation for less common isotopes - so this is at least pragmatically the best way!
It would be nice if rcdk recognized these as inputs for mass calculations, but AFAIK the underlying cdk java code assumes natural isotope abundance distributions for input elemental symbols to calculate monoisotopic and average masses. This would need some tweaking to effectively hard-set the isotope distributions. Going the other way (i.e. defining isotope limits to generate a formula from an input mass, with outputs containing annotated isotope symbols) is relatively easier given the underlying cdk java code already does those calculations, and Raj recently very helpfully updated the rcdk code to incorporate this.
Tony
On 20 April 2018 at 08:37, Emma Schymanski notifications@github.com wrote:
@trljcl https://github.com/trljcl thanks for jumping in; have you found any links to information defining the actual conventions? I have just emailed a colleague if he knows any (as we just debated this at great length for InChI specs)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rajarshi/cdkr/issues/65#issuecomment-383010136, or mute the thread https://github.com/notifications/unsubscribe-auth/ADeo0oIA4WkEon-d3s0AvT95YMWYT6Miks5tqZBBgaJpZM4TWWKm .
-- Dr. Tony R. Larson Head of Metabolomics & Proteomics, Department of Biology, Area 15 University of York Wentworth Way Heslington York YO10 5DD UK
Tel: +44(0)1904 328 733 (office) Tel: +44(0)7833 471 685 (mobile)
tony.larson@york.ac.uk http://scholar.google.com/citations?user=9hLFka4AAAAJ www.york.ac.uk/biology/technology-facility/proteomics/ www.york.ac.uk/mass-spectrometry
Thanks for all the feedback - especially since the usage you're all discussing is pretty far from my expertise!
If you can point me to docs regarding standardized (or even commonly accepted) format for mass number annotation in a formula, I can look at that on the CDK side. Personally, the current version (adding mass number to every element) is ugly, and it appears to also be incompatible.
@schymane re the definition of major isotope, one way around it is to manually specify the desired mass number in the element list, which forces rcdk
to employ that specific isotope, rather than go with a major isotope (however that is defined)
I will post any documentation if/when I find any :-) Re: “however it is defined” … if it is clearly defined/documented somewhere in the CDK how it is defined, that’s fine (I find it very difficult to navigate CDK documentation…). Otherwise when I get a chance I will do some number crunching with the tricky cases (Sn, Se) and see what happens … it would be nice not to have to define things explicitly every time. If you add some simple cases to your documentation I can try out some trickier cases and add them to your rcdk documentation if/once I get the chance so it’s clear.
Btw thanks for getting this moving, we’ve been doing workaround functions for years, but now that we really have to handle isotopes properly and now that CDK2.0 is out I think it’s a great chance to get this fixed!
My bad - the CDK Javadocs do define what the major isotope is. See here
Returns the most abundant (major) isotope with a given atomic number.
The isotope's abundance is for atoms with atomic number 60 and smaller defined
as a number that is proportional to the 100 of the most abundant isotope. For atoms
with higher atomic numbers, the abundance is defined as a percentage.
Just checked with enviPat: It uses the notation of []Brackets before the element symbols for none-main Isotopes and no brackets for major isotopes. It also always needs an atom count, even if it is 1. Ex: [15]N1H3
But in enviPat one can always define new isotopes with whatever nomenclature one wants by simply appending to their isotope list.
Emma asked me if I had any thoughts.
My only recommendation for mass calculations is that "[##]Ee" refer to specific isotope mass and that "Ee" refer to elemental natural abundance mass. If done this way, then a combination of exact isotope mass and/or elemental natural abundance mass can be used to calculate molecular mass.
Thanks @hunter-moseley - If we know the exact assumption that CDK uses for defining the major isotope, then surely we can do both natural abundance and exact isotope mass implicitly if [##] is missing for the major isotope? It will be rather ugly to have to deal with explicitly-defined numbers in every formula...and this is not something I would like to e.g. see annotated in MassBank records, ideally we'd be able to have a compact and readable molecular formula / fragment annotation (and hide the details behind the scenes) See e.g. PK$ANNOTATION here: https://massbank.eu/MassBank/jsp/RecordDisplay.jsp?id=AU169406&dsn=UOA
@rajarshi I find the CDK's definition rather strange re 100 vs % above atomic number 60 ... can't quite visualize the consequences but haven't had a chance to crunch the numbers. Is there a reason for such a disjoint definition? Does the >atomic number 60 definition overlap with the way it is defined here? http://www.sisweb.com/referenc/source/exactmas.htm
The original display chosen by @rajarshi is consistent with the SMILES annotation ... but I think we should still aim for consistency between other software approaches? The square brackets in SMILES capture different/additional information in a different way that is not relevant to us. http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html
Smiles | Name |
---|---|
[12C] | carbon-12 |
[13C] | carbon-13 |
[C] | carbon (unspecified mass) |
[13CH4] | C-13 methane |
@schymane re the CDK definition of major isotope - I unfortunately don't know why it was chosen. The Javadocs indicate it was written by Chris Steinbeck, so I guess he could shed some light.
Re the string representation - yes, the SMILES approach influenced me, but given that molecular formulae strings are not SMILES, I don't think we have to be stick to that, and rather go with the more accepted representation used by this community.
Interestingly, looking at the Java sources suggests that the major isotope is simply the most abundant isotope for any element (and no consideration is made for atomic numbers < 60 or > 60).
The problem is that "major" isotope loses some of its meaning when the percentage drop below 50%. Take molybdenum for instance: https://en.wikipedia.org/wiki/Isotopes_of_molybdenum . I suggest that you have two interpretation of mass based on the definitions of "nominal mass" and "most abundant mass".
Definition of "nominal mass": https://en.wikipedia.org/wiki/Mass_(mass_spectrometry)#Nominal_mass This would be in contrast to the definition of "most abundant mass": https://en.wikipedia.org/wiki/Mass_(mass_spectrometry)#Most_abundant_mass
So currently, CDK's getMajorIsotope
corresponds to the Most abundant mass
definition. I guess we'd have to add an annotation for stable isotopes to be able to return the Nominal mass
result.
For the case of Mo, it seems that the nominal and most abundant masses correspond to the same isotope?
Hi Rcdk team
I have a small question regarding isotope annotation for generate.formula.iter().
If I want to annotate possible formulae to MS peaks, limited by the number of atoms of the parent compound. If I use the attached example without adding to the element list :
[[2]] [1] "H" "0" "4"
[[3]] [1] "Br" "0" "2"
[[4]] [1] "O" "0" "3"
Using this list, the formula of the monoisotopic peak (M-H) at 292.8454296 can easily be annotated as "C7H3Br2O3" by the generate.formula.iter() function. The problem is, that the M+2 peak at 294.8432039 is not annotated as the [81Br] isotope is not in the list.
if i modify the list to Br81:
[[2]] [1] "H" "0" "4"
[[3]] [1] "Br" "0" "2" "81"
[[4]] [1] "O" "0" "3"
I can only annotate the peak with 2 81Br atoms.
If I add an additional line with the 81Br (as shown in the example):
[[2]] [1] "H" "0" "4"
[[3]] [1] "Br" "0" "2"
[[4]] [1] "O" "0" "3"
[[5]] [1] "Br" "0" "2" "81"
I can annotate all 3 peaks as "C7H3Br2O3". Unfortunately, the annotation makes no difference between 79Br and 81Br in regard to the symbol.
My question is now, if the is a way (or if a way could be created), to safe the isotope entry in the list with a different symbol (like [81Br]) so as to be able to differentiate between the annotated isotopes.
Thank you in advance Benedikt Lauper Eawag Dübendorf Uchem