MolecularAI / REINVENT4

AI molecular design tool for de novo design, scaffold hopping, R-group replacement, linker design and molecule optimization.
Apache License 2.0
362 stars 89 forks source link

Problem with ReactionFilter (ValueError: ChemicalReactionParserException: a reaction requires at least two > characters) #140

Closed dhristozov closed 1 month ago

dhristozov commented 1 month ago

I am getting a ValueError when using the ReactionFilter component with reinvent 4.4.22 under Linux.

Traceback (most recent call last):
  File "...bin/reinvent", line 8, in <module>
    sys.exit(main())
  File "reinvent/Reinvent.py", line 334, in main
    runner(
  File "reinvent/runmodes/RL/run_staged_learning.py", line 301, in run_staged_learning
    packages = create_packages(reward_strategy, stages, rdkit_smiles_flags2)
  File "reinvent/runmodes/RL/run_staged_learning.py", line 160, in create_packages
    scoring_function = Scorer(stage.scoring)
  File "reinvent/scoring/scorer.py", line 75, in __init__
    self.components = get_components(config.component)
  File "reinvent/scoring/config.py", line 94, in get_components
    component = Component(component_params)
  File "reinvent_plugins/components/comp_reaction_filter.py", line 75, in __init__
    self.reaction_filters.append(filter_class(rf_params))
  File "reinvent/chemistry/library_design/reaction_filters/selective_filter.py", line 20, in __init__
    self._reactions = self._configure_reactions(configuration.reactions)
  File "reinvent/chemistry/library_design/reaction_filters/selective_filter.py", line 28, in _configure_reactions
    converted = self._chemistry.create_reactions_from_smarts(smarts_list)
  File "reinvent/chemistry/library_design/fragment_reactions.py", line 20, in create_reactions_from_smarts
    reactions = [AllChem.ReactionFromSmarts(smirks) for smirks in smarts]
  File "reinvent/chemistry/library_design/fragment_reactions.py", line 20, in <listcomp>
    reactions = [AllChem.ReactionFromSmarts(smirks) for smirks in smarts]
ValueError: ChemicalReactionParserException: a reaction requires at least two > characters

Stepping through the code, I see that the ReactionFilter component passes a ReactionFilterParams object to the actual filter class (line 75 in comp_reaction_filter.py). However, SelectiveFilter expects a ReactionFilterConfiguration object. From here there's a call to _configure_reactions which expects a Dict[str, List[str]] but neither ReactionFilterParams nor ReactionFilterConfiguration have such Dict. The issue seems to be that ReactionFilterParams provides a single list and not a list of lists, hence ultimately create_reactions_from_smarts receives a string and tries to parse each character as a reaction.

If I change the toml to

params.reaction_smarts = [[
    "[C:1](=[O:2])-[OD1].[N!H0:3]>>[C:1](=[O:2])[N:3]"
]]

runs complete as expected.

Is that the expected way to specify these? Thanks.

Minimal config file to reproduce the problem (please adjust paths to priors and make sure "some.scaffold.smi" exists)

run_type = "staged_learning"
device = "cuda:0"  # set torch device e.g. "cpu"
tb_logdir = "tb_logs"  # name of the TensorBoard logging directory
json_out_config = "_staged_learning.json"  # write this TOML to JSON

[parameters]
summary_csv_prefix = "staged_learning"  # prefix for the CSV file
use_checkpoint = false  # if true read diversity filter from agent_file
purge_memories = false  # if true purge all diversity filter memories after each stage

## LibInvent
prior_file = "priors/libinvent.prior"
agent_file = "priors/libinvent.prior"
smiles_file = "some.scaffolds.smi"  # 1 scaffold per line with attachment points

batch_size = 128          # network
unique_sequences = true  # if true remove all duplicates raw sequences in each step
randomize_smiles = true  # if true shuffle atoms in SMILES randomly

[learning_strategy]
type = "dap"      # dap: only one supported
sigma = 128       # sigma of the RL reward function
rate = 0.0001     # for torch.optim

[[stage]]
chkpt_file = 'test1.chkpt'  # name of the checkpoint file, can be reused as agent

termination = "simple"  # termination criterion fot this stage
max_score = 1  # terminate if this total score is exceeded
min_steps = 2  # run for at least this number of steps
max_steps = 3  # terminate entire run when exceeded

[stage.scoring]
type = "arithmetic_mean"  # aggregation function
parallel = false

[[stage.scoring.component]]  # RXN filter
[stage.scoring.component.ReactionFilter]

[[stage.scoring.component.ReactionFilter.endpoint]]
name = "ReactionFilter"
params.type = "selective"
params.reaction_smarts = [
    "[C:1](=[O:2])-[OD1].[N!H0:3]>>[C:1](=[O:2])[N:3]"
]
halx commented 1 month ago

Hi again,

the code is, for historical reasons, still a rather convoluted mess and it is a bit difficult to see what is going on. Error messages are not helping either. The relevant data structure is Parameters in said file which asks for a list of lists. In practice you may have several attachments points in a scaffold and so have to provide a reaction SMARTS for each. You may also want to provide several desired reaction patterns per attachment points and so need another entry in the list.

Many thanks, Hannes.

dhristozov commented 1 month ago

Hi Hannes,

Thanks for the quick reply!

Just to clarify, is specifying the reaction patterns as a list of list (with one list per attachment point, not sure how those will be mapped?) in toml the way to go?

Thanks, Dimitar

halx commented 1 month ago

Ah, yes. It is definitely a list of list where each inner list maps each RDKit reaction SMILES to the dummy atoms in the input scaffold from left to right as they appear in the SMILES string. It is strictly in that order. We do not make use of the atom map numbers!