Open dengjianyuan opened 3 years ago
@DENGJIANYUAN I was just about to ask the same question. The function that describes how they standardise SMILES is here https://github.com/BenevolentAI/MolBERT/blob/b410cb6c98133b68789f368cc90adff6cb04418a/molbert/utils/featurizer/molfeaturizer.py#L1099
There is a flag called canonicalise
which is set to True
.
Also there is a special set of chars here which includes @: https://github.com/BenevolentAI/MolBERT/blob/b410cb6c98133b68789f368cc90adff6cb04418a/molbert/utils/featurizer/molfeaturizer.py#L1028
@@ is already included in here https://github.com/BenevolentAI/MolBERT/blob/b410cb6c98133b68789f368cc90adff6cb04418a/molbert/utils/data/elements.txt#L120 and used here: https://github.com/BenevolentAI/MolBERT/blob/b410cb6c98133b68789f368cc90adff6cb04418a/molbert/utils/featurizer/molfeaturizer.py#L1014
So to answer 2) I think both @@ and @ are considered but in different parts of the code. I would however like to know if the input seqs (after they standardise SMILES) contain any isomeric information (which is critical for some ligands such as Carbohydrates)
Hey Guys, thanks for your interest in our work and apologies for the slow reply. I am also very glad we can provide source code.
I believe the same answer applies to many of the queries raised here. Since we tokenize SMILES char by char, we must handle multi-character elements differently. This is why they are separated in code. e.g. [Os] should be treated as Osmium and not aliphatic oxygen, aromatic sulphur.
For most cases, our solution is to Kekulize our SMILES. After kekulization, aromatic sulphur is now upper case, and can no longer be confused with [Os]. In the special case of aromatic [Se], we handle this differently since it is a two-character element that has both aliphatic and aromatic forms.
This is also the reason for separating @ and @@.
@LivC182 Regarding input sequences containing stereoisomeric information. MolBERT can handle @ and @@. This would be tokenized, encoded by the featurizer and potentially learned by our representation. However, we have not tested this since our training dataset (taken from GuacaMol) does not contain stereochemistry
Hi,
Thank you for providing the source codes on MOLBERT, which is a great work!
I have two questions on the elements.txt.
Many thanks in advance!! =)