MolecularAI / reaction_utils

Utilities for working with datasets of chemical reactions, reaction templates and template extraction.
https://molecularai.github.io/reaction_utils/
Apache License 2.0
59 stars 8 forks source link

USPTO processing pipeline: "remove_unsanitizable" implies "trim_rxn_smiles" called before #13

Open Academich opened 9 months ago

Academich commented 9 months ago

I have a task that requires USPTO with only sanitizable molecules but also with CXSMILES information retained. However, if I keep CXSMILES, the "remove_unsanitizable" pipeline step tries to sanitize products together with CXSMILES and naturally fails, which results in 700k reactions being invalidated. It would be nice if the product SMILES never ended up containing CXSMILES when being processed by RDKit, even if CXSMILES were not removed.

SGenheden commented 9 months ago

Thanks for your feedback. This is on our to-do list. It would naturally be better to parse the CXSMILES correct, which would entail employing level-1 parenthesis around SMILES that should be considered as one molecule.