Closed egonw closed 6 years ago
A first version is now implemented in master
and uses the RoundRobinFormulaGenerator
class. To guard against excessively large lists of formula being generated, the method returns an iterator. To retrieve formulae, you must iterate. By default the formulae are returned as strings. But if desired IMolecularFormula
objects can be returned
The function takes element ranges as a list. The function is defined as
generate.formula2 <- function(mass, window = 0.01,
elements = list(
C=c(0,50),
H=c(0,50),
N=c(0,50),
O=c(0,50),
S=c(0,50)),
validation = FALSE,
charge = 0.0,
as.string=TRUE)
and example usage is
library(rcdk)
library(itertools)
it <- generate.formula2(200, as.string=TRUE)
## get all formula at one go
forms <- as.list(enumerate(it))
## manually iterate
it <- generate.formula2(200, as.string=TRUE)
hit <- ihasNext(it)
while (hasNext(hit))
print(nextElem(hit))
Probably should come up with a better name. Also need to add docs.
As for speed-up, I'm seeing ~ 30x compare to the current method
library(rcdk)
library(itertools)
f1 <- function(n) {
it <- generate.formula2(n, as.string=FALSE)
as.list(enumerate(it))
}
f2 <- function(n) generate.formula(n)
n <- 500
mean(replicate(10, system.time(f1(n))[3]),trim=0.05)
mean(replicate(10, system.time(f2(n))[3]),trim=0.05)
With an R API something like:
The CDK API is described in https://github.com/cdk/cdk-paper-3/blob/master/formula_generator_benchmark/CDK/CDKFormulaGeneratorCLI.java