GT4SD / gt4sd-core

GT4SD, an open-source library to accelerate hypothesis generation in the scientific discovery process.
https://gt4sd.github.io/gt4sd-core/
MIT License
333 stars 69 forks source link

RT: Improved handling of substructures in generation task #209

Closed jannisborn closed 1 year ago

jannisborn commented 1 year ago

Minor PR. At inference time, a list of substructures could be provided to be excluded from the stochastic masking (substructures_to_mask: List[str])

OLD: If a substructure could not be found in the seed SMILES (matching occurred on string level), it was ignored for all further steps.

NEW: The substructure is only ignored entirely if additionally it cannot be identified in the seed molecule with an RDKit substructure match. Instead, if this matching test is positive, the substructure will be ignored for the masking & generation part, but it will be used in the post-hoc filtering which essentially ensures that all returned molecules contain the substructure (note that this might slow down the inference time in edge cases)