datamol-io / datamol

Molecular Processing Made Easy.
https://docs.datamol.io
Apache License 2.0
452 stars 47 forks source link

Added function to get the number of stereoisomers #217

Closed zhu0619 closed 9 months ago

zhu0619 commented 9 months ago

Changelogs

The step Chem.FindPotentialStereoBonds(mol, cleanIt=clean_it), the information on bond is cleared if cleanit=True. Therefore, cleanit should be disabled when performing enumeration or counting only on undefined stereochemistry when the molecules have defined stereo information on bonds.

See example below: image

Reproduce the error

import datamol as dm
from rdkit import Chem

from rdkit.Chem.EnumerateStereoisomers import GetStereoisomerCount, StereoEnumerationOptions, EnumerateStereoisomers
n_variants= 20
undefined_only= True # <-
rationalise = True
timeout_seconds= None
clean_it= True
stereo_opts = StereoEnumerationOptions(
        tryEmbedding=rationalise,
        onlyUnassigned=undefined_only,
        unique=True,
    )
mol  = dm.to_mol('Br/C=C\Br')
Chem.AssignStereochemistry(mol, force=False, flagPossibleStereoCenters=True, cleanIt=clean_it)  # type: ignore
Chem.FindPotentialStereoBonds(mol, cleanIt=clean_it)  # type: ignore
dm.to_image(list(EnumerateStereoisomers(mol, options=stereo_opts)))
codecov[bot] commented 9 months ago

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (9e94d02) 91.96% compared to head (e812492) 91.93%.

Files Patch % Lines
datamol/isomers/_enumerate.py 90.90% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #217 +/- ## ========================================== - Coverage 91.96% 91.93% -0.03% ========================================== Files 46 46 Lines 3832 3843 +11 ========================================== + Hits 3524 3533 +9 - Misses 308 310 +2 ``` | [Flag](https://app.codecov.io/gh/datamol-io/datamol/pull/217/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=datamol-io) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/datamol-io/datamol/pull/217/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=datamol-io) | `91.93% <91.66%> (-0.03%)` | :arrow_down: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=datamol-io#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

zhu0619 commented 9 months ago

Thanks Lu.

It looks good to me after fixing the docstring.

Question: my understanding is that GetStereoisomerCount will actually do that exact same as enumerate_stereoisomers with n_variants=<MAX> and simply call len() on the output. Am I correct here? Maybe check what the rdkit code is doing under the hood. Not really a big deal for me here but I just wanted to flag it in case you think count() should instead reuse enumerate().

[GetStereoisomerCount](https://github.com/rdkit/rdkit/blob/2a68050ed07a3b27cabf33d535f0c46117135209/rdkit/Chem/EnumerateStereoisomers.py#L136C24-L136C24) computes an estimated number based on the stereo bonds. So in some cases, the counts from GetStereoisomerCount is larger than the enumerations.

Initially, I was using the output of enumerate_stereoisomers. But the computational time is too long especially for large dataset even with parallelization.

I will also add an option to count the isomer using enumerate_stereoisomers if the user needs more accurate counts.

hadim commented 9 months ago

ok, so it seems like GetStereoisomerCount is doing a slightly different things and also seems faster. All good then, thank you Lu!