CDK-R / cdkr

Integrating R and the CDK
https://cdk-r.github.io/cdkr/
42 stars 27 forks source link

support for CDK's MolecularFormulaGenerator #55

Closed egonw closed 6 years ago

egonw commented 7 years ago

With an R API something like:

> elementRanges <- matrix(c("C", 1, 10, "H", 0, 22), ncol=3, byrow=T)
> formulas <- rcdk.getFormulas(mass, tolerance, elementRanges)

The CDK API is described in https://github.com/cdk/cdk-paper-3/blob/master/formula_generator_benchmark/CDK/CDKFormulaGeneratorCLI.java

rajarshi commented 7 years ago

A first version is now implemented in master and uses the RoundRobinFormulaGenerator class. To guard against excessively large lists of formula being generated, the method returns an iterator. To retrieve formulae, you must iterate. By default the formulae are returned as strings. But if desired IMolecularFormula objects can be returned

The function takes element ranges as a list. The function is defined as

generate.formula2 <- function(mass, window = 0.01,
                              elements = list(
                                C=c(0,50),
                                H=c(0,50),
                                N=c(0,50),
                                O=c(0,50),
                                S=c(0,50)),
                              validation = FALSE,
                              charge = 0.0,
                              as.string=TRUE)

and example usage is

library(rcdk)
library(itertools)
it <- generate.formula2(200, as.string=TRUE)
## get all formula at one go
forms <- as.list(enumerate(it))
## manually iterate
it <- generate.formula2(200, as.string=TRUE)
hit <- ihasNext(it)
while (hasNext(hit)) 
  print(nextElem(hit))

Probably should come up with a better name. Also need to add docs.

rajarshi commented 7 years ago

As for speed-up, I'm seeing ~ 30x compare to the current method

library(rcdk)
library(itertools)
f1 <- function(n) {
  it <- generate.formula2(n, as.string=FALSE)
  as.list(enumerate(it))
}
f2 <- function(n) generate.formula(n)
n <- 500
mean(replicate(10, system.time(f1(n))[3]),trim=0.05)
mean(replicate(10, system.time(f2(n))[3]),trim=0.05)