How to deal with unseen characters when external SMILES are provided?

Data-reindeer commented 1 month ago

Hi,

Recently, I tried modifying the dap strategy of reinforcement learning provided in REINVENT. I want the model to undergo reinforcement learning training on some externally provided SMILES and their scores. However, when I ran it, I found that REINVENT couldn't handle some special external characters (out of vocabulary), even though they are valid SMILES, e.g. [NH1], ., [C@@H1]. The example error information is as follows:

Traceback (most recent call last):
  File "/home/REINVENT4/reinvent/runmodes/RL/run_revo_learning.py", line 408, in run_revo_learning
    terminate = optimize(package.terminator)
  File "/home/REINVENT4/reinvent/runmodes/RL/learning.py", line 149, in optimize
    agent_lls, prior_lls, augmented_nll, loss = self.update(results)
  File "/home/REINVENT4/reinvent/runmodes/RL/revoinvent.py", line 28, in update
    agent_nlls = torch.cat([self._state.agent.likelihood_smiles(self.sampled.items1),
  File "/home/REINVENT4/reinvent/models/model_factory/reinvent_adapter.py", line 17, in likelihood_smiles
    return self.model.likelihood_smiles(smiles)
  File "/home/REINVENT4/reinvent/models/reinvent/models/model.py", line 143, in likelihood_smiles
    encoded = [self.vocabulary.encode(token) for token in tokens]
  File "/home/REINVENT4/reinvent/models/reinvent/models/model.py", line 143, in <listcomp>
    encoded = [self.vocabulary.encode(token) for token in tokens]
  File "/home/REINVENT4/reinvent/models/reinvent/models/vocabulary.py", line 60, in encode
    vocab_index[i] = self._tokens[token]
KeyError: '[NH1]'

Do you know how I can fix this problem? It seems that this is because the prior model is pretrained on some regular molecular datasets without these special tokens. Can I just add these words to the vocabulary?

Thanks!

halx commented 1 month ago

Hi,

the REINVENT priors have a fixed vocabulary and we do not have a way to extend it. Validity of SMILES is insufficient. The dot in SMILES is btw effectively a noop operation and used as a fragment separator. That would only be usefule for the Lib- and Linkinvent generators although there, for historical reasons, the pipe "||" symbol is used instead. Stereochemistry is only supported in the Mol2Mol generator. You will have to filter those SMILES.

I do not know what you are trying to do here but if you are trying to inject external SMILES into RL you are essentially doing off-policy RL. You would have to think how to incorporate this into the loss function.

Cheers, Hannes.

Data-reindeer commented 1 month ago

Thanks for your helpful reply. I am indeed trying to design some novel RL loss functions to incorporate external SMILES. Moreover, the external SMILES are provided by other generative models, and the mentioned issue frequently occurs. If I further pretrain the prior model on other datasets containing those unseen characters, is it possible to extend the vocabulary?

Best, Data-Reindeer

halx commented 1 month ago

The vocabulary is a fixed part of the network as it has been trained on a fixed compound dataset with a fixed set of tokens. Retraining seems the only way to go.

Your work sounds very interesting and I guess it is some form of multi-agent approach. In case this interests you, I have a rather hackish implementation incorporating GB_GA, a genetic approach, in my fork but without any thought as to how the new loss function would need to be designed.

Data-reindeer commented 1 month ago

That helps a lot. I will refer to the logic in the code you provided. I’m closing this issue now. Thanks again for your response!

Data-reindeer commented 1 month ago

Hi, Hannes I still have a question. The vocabulary of reinvent.prior now seems very inadequate, especially for square bracket matching. It can only identify limited types, e.g. '[N+]', '[N-]', '[O-]', '[S+]', '[n+]', '[nH]', while some of the common types such as [NH4], [NH1] are not supported. Moreover, it does not support chiral information such as'@', which is also a crucial functionality for drug design. I wonder if there is a more comprehensive pre-trained prior model beyond the existing one. Or any update plans on this issue.

Best, Data-Reindeer

halx commented 1 month ago

Ok, this is considerably more complex than you may appreciate.

What you show me, suggests that you are trying to incorporate SMILES with explicit hydrogens. This is only necessary in cases where the chemistry would otherwise be ambigous and of course in the case of protonation states including protomeric tautomers. We have made the concious decision to not go with explicit hydrogens because it is precisly those protonation states that would complicate things considerably. The same goes for stereochemistry which, to me, only really makes sense if you are actually use 3D scoring components. It is unclear if REINVENT would be able to pick up a signal here as it would need to learn, implicitly, that a molecule is three dimensional. So we leave those aspects out to be handled by scoring components accordingly,

What you can do is to remove explicit hydrogens and "flatten" the SMILES (remove stereochemistry). We recently published a paper on active learning where we find quite good performance despite the fact that REINVENT works in an information-deficient environment (regarding the "real" chemistry).

Another question is, on what valence/aromaticity model your generator has been built. If that was not RDKit you cannot assume that RDKit will interpret the SMILES consistently. That's a fundamental problem of chemoinformatics.

Data-reindeer commented 1 month ago

Thank you for your detailed response; it makes sense. I will try to remove the stereochemistry information from the external SMILES. Additionally, the article you provided is very helpful, and I will check it out to see if I can get some inspiration.

MolecularAI / REINVENT4

How to deal with unseen characters when external SMILES are provided? #130