BenevolentAI / MolBERT

MIT License
126 stars 37 forks source link

symbol and char in the elements.txt #6

Open dengjianyuan opened 3 years ago

dengjianyuan commented 3 years ago

Hi,

Thank you for providing the source codes on MOLBERT, which is a great work!

I have two questions on the elements.txt.

  1. Why only 'se' is denoted as AromaticSe? How about aromatic C/N/S, etc?
  2. Why only @@ is recorded for chirality? Don't we also need to record @ for counter clockwise spiral, which is a common symbol is SMILES strings...

Many thanks in advance!! =)

LivC193 commented 3 years ago

@DENGJIANYUAN I was just about to ask the same question. The function that describes how they standardise SMILES is here https://github.com/BenevolentAI/MolBERT/blob/b410cb6c98133b68789f368cc90adff6cb04418a/molbert/utils/featurizer/molfeaturizer.py#L1099 There is a flag called canonicalise which is set to True.

Also there is a special set of chars here which includes @: https://github.com/BenevolentAI/MolBERT/blob/b410cb6c98133b68789f368cc90adff6cb04418a/molbert/utils/featurizer/molfeaturizer.py#L1028

@@ is already included in here https://github.com/BenevolentAI/MolBERT/blob/b410cb6c98133b68789f368cc90adff6cb04418a/molbert/utils/data/elements.txt#L120 and used here: https://github.com/BenevolentAI/MolBERT/blob/b410cb6c98133b68789f368cc90adff6cb04418a/molbert/utils/featurizer/molfeaturizer.py#L1014

So to answer 2) I think both @@ and @ are considered but in different parts of the code. I would however like to know if the input seqs (after they standardise SMILES) contain any isomeric information (which is critical for some ligands such as Carbohydrates)

JoshuaMeyers commented 3 years ago

Hey Guys, thanks for your interest in our work and apologies for the slow reply. I am also very glad we can provide source code.

I believe the same answer applies to many of the queries raised here. Since we tokenize SMILES char by char, we must handle multi-character elements differently. This is why they are separated in code. e.g. [Os] should be treated as Osmium and not aliphatic oxygen, aromatic sulphur.

For most cases, our solution is to Kekulize our SMILES. After kekulization, aromatic sulphur is now upper case, and can no longer be confused with [Os]. In the special case of aromatic [Se], we handle this differently since it is a two-character element that has both aliphatic and aromatic forms.

This is also the reason for separating @ and @@.

JoshuaMeyers commented 3 years ago

@LivC182 Regarding input sequences containing stereoisomeric information. MolBERT can handle @ and @@. This would be tokenized, encoded by the featurizer and potentially learned by our representation. However, we have not tested this since our training dataset (taken from GuacaMol) does not contain stereochemistry