CDK-R / cdkr

Integrating R and the CDK
https://cdk-r.github.io/cdkr/
42 stars 27 forks source link

Rcdk isotopes in generate.formula.iter() #65

Closed lauperbe closed 3 months ago

lauperbe commented 6 years ago

Hi Rcdk team

I have a small question regarding isotope annotation for generate.formula.iter().

If I want to annotate possible formulae to MS peaks, limited by the number of atoms of the parent compound. If I use the attached example without adding to the element list :

elements [[1]] [1] "C" "0" "7"

[[2]] [1] "H" "0" "4"

[[3]] [1] "Br" "0" "2"

[[4]] [1] "O" "0" "3"

Using this list, the formula of the monoisotopic peak (M-H) at 292.8454296 can easily be annotated as "C7H3Br2O3" by the generate.formula.iter() function. The problem is, that the M+2 peak at 294.8432039 is not annotated as the [81Br] isotope is not in the list.

if i modify the list to Br81:

elements [[1]] [1] "C" "0" "7"

[[2]] [1] "H" "0" "4"

[[3]] [1] "Br" "0" "2" "81"

[[4]] [1] "O" "0" "3"

I can only annotate the peak with 2 81Br atoms.

If I add an additional line with the 81Br (as shown in the example):

elements [[1]] [1] "C" "0" "7"

[[2]] [1] "H" "0" "4"

[[3]] [1] "Br" "0" "2"

[[4]] [1] "O" "0" "3"

[[5]] [1] "Br" "0" "2" "81"

I can annotate all 3 peaks as "C7H3Br2O3". Unfortunately, the annotation makes no difference between 79Br and 81Br in regard to the symbol.

My question is now, if the is a way (or if a way could be created), to safe the isotope entry in the list with a different symbol (like [81Br]) so as to be able to differentiate between the annotated isotopes.

Thank you in advance Benedikt Lauper Eawag Dübendorf Uchem

lauperbe commented 6 years ago

minimal_working_example.txt

rajarshi commented 6 years ago

This looks like an issue on the CDK side, where it explicitly ignores mass numbers when generating the string formula

rajarshi commented 6 years ago

If you update to the latest github versions of rcdklibs and rcdk, the resultant formulae strings are now labeled with mass numbers.

Also generate.formula.iter is updated to take the same elements argument as generate.formula. So your updated code looks like

library(rJava)
library(reshape2)
library(stringi)
library(rcdk)
library(RMassBank)

#parent substance (known)
target_name <- "3,5-dibromo-4-hydroxybenzoic_acid_804"
target_formula <- "C7H4Br2O3"

#masses of peaks to analyze (first is monoisotopic parent, next two are isotopologues and last is a possible in-source-fragment)
target_peaks <- c(292.8454296, 294.8432039, 296.8411527, 135.0452735)

subformula <- c()
elements <- lapply(formulastring.to.list(target_formula), range, 0)   #gives me a list to limit the formula generation

for (i in names(elements)) {
    tmp <- elements[[i]]
    tmp <- c(i, tmp)
    elements[[i]] <- tmp
}

elements[[5]] <- c("Br",0,2,81)

results <- lapply(target_peaks, function(tp) {
    mit <- generate.formula.iter(target_peaks[tp], window = 0.05, elements, charge = 1, as.string=FALSE) 
    hit <- itertools::ihasNext(mit)
    as.list(hit)
})

result <- c()
for (j in 1:length(target_peaks)){
  result<-c()
  mit <- generate.formula.iter(target_peaks[j], window = 0.01, elements, charge = 1, as.string=TRUE) 
  hit <- itertools::ihasNext(mit)
  while (itertools::hasNext(hit))
    result <-  iterators::nextElem(hit) 

  if(!is.null(result)){ # writes found formulae into vector
    subformula[j] <- result
  }else{subformula[j]<-NA}

}
subformula
lauperbe commented 6 years ago

Thank you for your answer. This was exactly what I was looking for. Unfortunately, the output of the function is now not compatible anymore with further rcdk analysis. If I run the updated script, the output reads (> subformula [1] "[12C]7[1H]3[79Br]2[16O]3" "[12C]7[1H]3[79Br][81Br][16O]3" "[12C]7[1H]3[81Br]2[16O]3" "none" )

But if I now try to generate a Rcdk formula element from the output via get.formula(subformula[1],1) I get the error: Error in .jcall(manipulator, "Lorg/openscience/cdk/interfaces/IMolecularFormula;", : java.lang.NullPointerException

minimal_working_example_updated_isotopes.txt

rajarshi commented 6 years ago

Ah yes - the rest of the CDK code doesn't recognize the mass number annotated formulae. One workaround for now is to tell the generator to return formula objects rather than strings.

  mit <- generate.formula.iter(target_peaks[j], window = 0.01, elements, charge = 1, as.string=FALSE) 

So, doing this gives you something like

> subformula <- list()
> for (j in 1:length(target_peaks)){
+   result<-c()
+   mit <- generate.formula.iter(target_peaks[j], window = 0.01, elements, charge = 1, as.string=FALSE) 
+   hit <- itertools::ihasNext(mit)
+   while (itertools::hasNext(hit))
+     result <-  iterators::nextElem(hit) 
+   if(is.null(result)==F){ # writes found formulae into vector
+       subformula[[j]] <- result
+   }else{
+       subformula[[j]]<- "none"
+   }
+ }
> subformula
[[1]]
[1] "Java-Object{org.openscience.cdk.formula.MolecularFormula@67f89fa3}"

[[2]]
[1] "Java-Object{org.openscience.cdk.formula.MolecularFormula@4ac68d3e}"

[[3]]
[1] "Java-Object{org.openscience.cdk.formula.MolecularFormula@277c0f21}"

[[4]]
[1] "none"

You could then manipulate the formula objects using CDK classes/methods via .jcall. It's a bit klunky, but until we update the CDK side of things, this would be the best way

rajarshi commented 6 years ago

Also, on a somewhat unrelated note, the elements list looks like

> elements
$C
[1] "C" "0" "7"

$H
[1] "H" "0" "4"

$Br
[1] "Br" "0"  "2" 

$O
[1] "O" "0" "3"

[[5]]
[1] "Br" "0"  "2"  "81"

For the entries where mass number is not specified, is it expected or assumed that the major isotope is to be used?

lauperbe commented 6 years ago

For the not specified mass numbers, the major isotope is assumed and cdk also uses it like this.

Thanks for the answers

schymane commented 6 years ago

How is the major isotope defined? The isotope of maximum intensity, or a la the InChI definition (the rounded average atomic mass of the element) – which is not the same and leads to confusion for elements such as Se and Sn? For Sn this would mean either 120Sn (max abundance of 32%) or 119Sn (rounded from average mass of 118.71) is the reference. Is there a way to make the isotope explicit to avoid confusion/assumptions?

schymane commented 6 years ago

@lauperbe you may also want to check out how compatible this all is with enviPat. This surely handles isotopes because they have to; we should try and get rcdk and enviPat annotating formulas in consistent ways that follow clearly defined chemical conventions if possible (a lot of my/our code uses both packages)

(> subformula [1] "[12C]7[1H]3[79Br]2[16O]3" "[12C]7[1H]3[79Br][81Br][16O]3" "[12C]7[1H]3[81Br]2[16O]3" "none" )

trljcl commented 6 years ago

If I can jump in here, I also thought the standard notation for identifying less common isotopes is an integer value wrapped in square brackets prefixing the elemental symbol, e.g. in the example

[12C]7[1H]3[79Br]2[16O]3

as 79Br is the only element with a less common isotope, this should be implicitly written as

C7H3[79]Br2O3

enviPat and (and also commercial MS software packages with formula generators / interpreters) recognize this. They do not recognize C7H3[79Br]2O3

thanks Tony

On 20 April 2018 at 08:05, Emma Schymanski notifications@github.com wrote:

@lauperbe you may also want to check out how compatible this all is with enviPat. This surely handles isotopes because they have to; we should try and get rcdk and enviPat annotating formulas in consistent ways that follow clearly defined chemical conventions if possible (a lot of my/our code uses both packages)

(> subformula [1] https://maps.google.com/?q=1%5D++%5B12C%5D7&entry=gmail&source=g" [12C]7 https://maps.google.com/?q=1%5D++%5B12C%5D7&entry=gmail&source=g[1H]3[79Br]2[16O]3" "[12C]7[1H]3[79Br][81Br][16O]3" "[12C]7[1H]3[81Br]2[16O]3" "none" )

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/rajarshi/cdkr/issues/65#issuecomment-383001848, or mute the thread https://github.com/notifications/unsubscribe-auth/ADeo0l6Kc0RmClUGILEpi7s7zmY_W2Ynks5tqYi5gaJpZM4TWWKm .

-- Dr. Tony R. Larson Head of Metabolomics & Proteomics, Department of Biology, Area 15 University of York Wentworth Way Heslington York YO10 5DD UK

Tel: +44(0)1904 328 733 (office) Tel: +44(0)7833 471 685 (mobile)

tony.larson@york.ac.uk http://scholar.google.com/citations?user=9hLFka4AAAAJ www.york.ac.uk/biology/technology-facility/proteomics/ www.york.ac.uk/mass-spectrometry

schymane commented 6 years ago

@trljcl thanks for jumping in; have you found any links to information defining the actual conventions? I have just emailed a colleague if he knows any (as we just debated this at great length for InChI specs)

schymane commented 6 years ago

@ChemConnector we need the ACS Style guide open! ... I can't find info in Wikipedia and the InChI specs don't cover this ... This does not cover computational representations for formulae well as far as I can see: https://en.wikipedia.org/wiki/Chemical_formula

trljcl commented 6 years ago

No, sorry - I am a primarily a small molecule biologist who uses various packages to annotate monoisotopic MS data. All software I've come across use the [i]symbol notation for less common isotopes - so this is at least pragmatically the best way!

It would be nice if rcdk recognized these as inputs for mass calculations, but AFAIK the underlying cdk java code assumes natural isotope abundance distributions for input elemental symbols to calculate monoisotopic and average masses. This would need some tweaking to effectively hard-set the isotope distributions. Going the other way (i.e. defining isotope limits to generate a formula from an input mass, with outputs containing annotated isotope symbols) is relatively easier given the underlying cdk java code already does those calculations, and Raj recently very helpfully updated the rcdk code to incorporate this.

Tony

On 20 April 2018 at 08:37, Emma Schymanski notifications@github.com wrote:

@trljcl https://github.com/trljcl thanks for jumping in; have you found any links to information defining the actual conventions? I have just emailed a colleague if he knows any (as we just debated this at great length for InChI specs)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rajarshi/cdkr/issues/65#issuecomment-383010136, or mute the thread https://github.com/notifications/unsubscribe-auth/ADeo0oIA4WkEon-d3s0AvT95YMWYT6Miks5tqZBBgaJpZM4TWWKm .

-- Dr. Tony R. Larson Head of Metabolomics & Proteomics, Department of Biology, Area 15 University of York Wentworth Way Heslington York YO10 5DD UK

Tel: +44(0)1904 328 733 (office) Tel: +44(0)7833 471 685 (mobile)

tony.larson@york.ac.uk http://scholar.google.com/citations?user=9hLFka4AAAAJ www.york.ac.uk/biology/technology-facility/proteomics/ www.york.ac.uk/mass-spectrometry

rajarshi commented 6 years ago

Thanks for all the feedback - especially since the usage you're all discussing is pretty far from my expertise!

If you can point me to docs regarding standardized (or even commonly accepted) format for mass number annotation in a formula, I can look at that on the CDK side. Personally, the current version (adding mass number to every element) is ugly, and it appears to also be incompatible.

@schymane re the definition of major isotope, one way around it is to manually specify the desired mass number in the element list, which forces rcdk to employ that specific isotope, rather than go with a major isotope (however that is defined)

schymane commented 6 years ago

I will post any documentation if/when I find any :-) Re: “however it is defined” … if it is clearly defined/documented somewhere in the CDK how it is defined, that’s fine (I find it very difficult to navigate CDK documentation…). Otherwise when I get a chance I will do some number crunching with the tricky cases (Sn, Se) and see what happens … it would be nice not to have to define things explicitly every time. If you add some simple cases to your documentation I can try out some trickier cases and add them to your rcdk documentation if/once I get the chance so it’s clear.

Btw thanks for getting this moving, we’ve been doing workaround functions for years, but now that we really have to handle isotopes properly and now that CDK2.0 is out I think it’s a great chance to get this fixed!

rajarshi commented 6 years ago

My bad - the CDK Javadocs do define what the major isotope is. See here

Returns the most abundant (major) isotope with a given atomic number.
The isotope's abundance is for atoms with atomic number 60 and smaller defined 
as a number that is proportional to the 100 of the most abundant isotope. For atoms 
with higher atomic numbers, the abundance is defined as a percentage.
lauperbe commented 6 years ago

Just checked with enviPat: It uses the notation of []Brackets before the element symbols for none-main Isotopes and no brackets for major isotopes. It also always needs an atom count, even if it is 1. Ex: [15]N1H3

But in enviPat one can always define new isotopes with whatever nomenclature one wants by simply appending to their isotope list.

hunter-moseley commented 6 years ago

Emma asked me if I had any thoughts.

My only recommendation for mass calculations is that "[##]Ee" refer to specific isotope mass and that "Ee" refer to elemental natural abundance mass. If done this way, then a combination of exact isotope mass and/or elemental natural abundance mass can be used to calculate molecular mass.

schymane commented 6 years ago

Thanks @hunter-moseley - If we know the exact assumption that CDK uses for defining the major isotope, then surely we can do both natural abundance and exact isotope mass implicitly if [##] is missing for the major isotope? It will be rather ugly to have to deal with explicitly-defined numbers in every formula...and this is not something I would like to e.g. see annotated in MassBank records, ideally we'd be able to have a compact and readable molecular formula / fragment annotation (and hide the details behind the scenes) See e.g. PK$ANNOTATION here: https://massbank.eu/MassBank/jsp/RecordDisplay.jsp?id=AU169406&dsn=UOA

@rajarshi I find the CDK's definition rather strange re 100 vs % above atomic number 60 ... can't quite visualize the consequences but haven't had a chance to crunch the numbers. Is there a reason for such a disjoint definition? Does the >atomic number 60 definition overlap with the way it is defined here? http://www.sisweb.com/referenc/source/exactmas.htm

schymane commented 6 years ago

The original display chosen by @rajarshi is consistent with the SMILES annotation ... but I think we should still aim for consistency between other software approaches? The square brackets in SMILES capture different/additional information in a different way that is not relevant to us. http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html

Smiles Name
[12C] carbon-12
[13C] carbon-13
[C] carbon (unspecified mass)
[13CH4] C-13 methane
rajarshi commented 6 years ago

@schymane re the CDK definition of major isotope - I unfortunately don't know why it was chosen. The Javadocs indicate it was written by Chris Steinbeck, so I guess he could shed some light.

Re the string representation - yes, the SMILES approach influenced me, but given that molecular formulae strings are not SMILES, I don't think we have to be stick to that, and rather go with the more accepted representation used by this community.

rajarshi commented 6 years ago

Interestingly, looking at the Java sources suggests that the major isotope is simply the most abundant isotope for any element (and no consideration is made for atomic numbers < 60 or > 60).

hunter-moseley commented 6 years ago

The problem is that "major" isotope loses some of its meaning when the percentage drop below 50%. Take molybdenum for instance: https://en.wikipedia.org/wiki/Isotopes_of_molybdenum . I suggest that you have two interpretation of mass based on the definitions of "nominal mass" and "most abundant mass".

Definition of "nominal mass": https://en.wikipedia.org/wiki/Mass_(mass_spectrometry)#Nominal_mass This would be in contrast to the definition of "most abundant mass": https://en.wikipedia.org/wiki/Mass_(mass_spectrometry)#Most_abundant_mass

rajarshi commented 6 years ago

So currently, CDK's getMajorIsotope corresponds to the Most abundant mass definition. I guess we'd have to add an annotation for stable isotopes to be able to return the Nominal mass result.

For the case of Mo, it seems that the nominal and most abundant masses correspond to the same isotope?