datamol-io / splito

Machine Learning dataset splitting for life sciences.
https://splito-docs.datamol.io/
Apache License 2.0
23 stars 2 forks source link

rdkit.Chem.rdFingerprintGenerator.GetMorganGenerator() did not match C++ signature #16

Open tsa87 opened 3 months ago

tsa87 commented 3 months ago

There is an error when using splito with rdkit=2024.3.4. nBits is not an argument for rdFingerprintGenerator.GetMorganGenerator, however it is used as the default argument for initalizingGetMorganGenerator here: https://github.com/datamol-io/splito/blob/654e4270f54363db894c32a6ab5fca2414738017/splito/_distance_split_base.py#L16

The documentation for RDKit has fpSize instead. This might have changed in the new version - we should probably update nBits to fpSize.

rdkit.Chem.rdFingerprintGenerator.GetMorganGenerator([(int)radius=3[, (bool)countSimulation=False[, (bool)includeChirality=False[, (bool)useBondTypes=True[, (bool)onlyNonzeroInvariants=False[, (bool)includeRingMembership=True[, (AtomPairsParameters)countBounds=None[, (int)fpSize=2048[, (AtomPairsParameters)atomInvariantsGenerator=None[, (AtomPairsParameters)bondInvariantsGenerator=None[, (bool)includeRedundantEnvironments=False]]]]]]]]]]]) 

Here is the full stacktrace:

    train_idx, test_idx = next(splitter.split(X=keys))
  File "/home/rileyparsons/Function/venv/lib/python3.10/site-packages/sklearn/model_selection/_split.py", line 1841, in split
    for train, test in self._iter_indices(X, y, groups):
  File "/home/rileyparsons/Function/venv/lib/python3.10/site-packages/splito/_distance_split_base.py", line 125, in _iter_indices
    X, self._metric = convert_to_default_feats_if_smiles(X, self._metric, n_jobs=self._n_jobs)
  File "/home/rileyparsons/Function/venv/lib/python3.10/site-packages/splito/_distance_split_base.py", line 51, in convert_to_default_feats_if_smiles
    X = dm.utils.parallelized(_to_feats, X, n_jobs=n_jobs)
  File "/home/rileyparsons/Function/venv/lib/python3.10/site-packages/datamol/utils/jobs.py", line 256, in parallelized
    return runner(fn, inputs_list, arg_type=arg_type)
  File "/home/rileyparsons/Function/venv/lib/python3.10/site-packages/datamol/utils/jobs.py", line 158, in __call__
    return self.sequential(*args, **kwargs)
  File "/home/rileyparsons/Function/venv/lib/python3.10/site-packages/datamol/utils/jobs.py", line 113, in sequential
    results = [
  File "/home/rileyparsons/Function/venv/lib/python3.10/site-packages/datamol/utils/jobs.py", line 114, in <listcomp>
    JobRunner.wrap_fn(callable_fn, arg_type, **fn_kwargs)(dt)
  File "/home/rileyparsons/Function/venv/lib/python3.10/site-packages/datamol/utils/jobs.py", line 83, in _run
    return fn(args, **fn_kwargs)
  File "/home/rileyparsons/Function/venv/lib/python3.10/site-packages/splito/_distance_split_base.py", line 45, in _to_feats
    feats = dm.to_fp(
  File "/home/rileyparsons/Function/venv/lib/python3.10/site-packages/datamol/fp.py", line 288, in to_fp
    fp_func = fp_func(**fp_args)
Boost.Python.ArgumentError: Python argument types in
    rdkit.Chem.rdFingerprintGenerator.GetMorganGenerator()
did not match C++ signature:
    GetMorganGenerator(unsigned int radius=3, bool countSimulation=False, bool includeChirality=False, bool useBondTypes=True, bool onlyNonzeroInvariants=False, bool includeRingMembership=True, boost::python::api::object {lvalue} countBounds=None, unsigned int fpSize=2048, boost::python::api::object {lvalue} atomInvariantsGenerator=None, boost::python::api::object {lvalue} bondInvariantsGenerator=None, bool includeRedundantEnvironments=False
cwognum commented 3 months ago

Hi @tsa87, thanks for reporting!

It seems this was caused by https://github.com/datamol-io/datamol/pull/226, where datamol switched from using GetMorganFingerprintAsBitVect to GetMorganGenerator. We should indeed update nBits to fpSize and set a minimum version for datamol.

Do you want to create a PR?

tsa87 commented 3 months ago

Hey @cwognum, okay that makes sense! I'll create a quick PR to update the argument and fix datamol version to >=0.12.5