Some SMILES from E.coli dataset can not be correctly processed.

Phuangji commented 1 month ago

Hello! When I predict kcat of some E.coli reactions, it says my SMILES are out of the range of atoms like this. For example, CC1(CC(=O)O)C2=Cc3[nH]c(c(CCC(=O)O)c3CC(=O)O)Cc3[nH]c(c(CC(=O)O)c3CCC(=O)O)C=C3N=C(C=C(N2)C1CCC(=O)O)C(C)(CC(=O)O)C3CCC(=O)O

Traceback (most recent call last):
  File "predict.py", line 82, in <module>
    pred = M( atoms_pad, atoms_mask, adjacencies_pad, amino_pad, amino_mask, batch_fps, inv_Temp, Temp )
  File "/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/DLTKcat/code/DLTKcat.py", line 161, in forward
    atoms_vector = self.comp_gat(atoms, adjacency)
  File "/DLTKcat/code/DLTKcat.py", line 102, in comp_gat
    atoms_vector = self.embedding_layer_atom(atoms)
  File "/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/torch/nn/modules/sparse.py", line 162, in forward
    self.norm_type, self.scale_grad_by_freq, self.sparse)
  File "/torch/nn/functional.py", line 2210, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self

I use CPU here. If use GPU there will be similar errors. Could you please tell me what's the problem with my operation? Looking forward to your reply. Thank you!

SizheQiu commented 1 month ago

I think the "[nH]" in the SMILES string is the issue. In the training dataset I used, all SMILES strings were canonical SMILES, and those with ions were all filtered out. Please make sure the SMILES string of the substrate was in canonical form.

Phuangji commented 1 month ago

Thank you for your reply. There are still some questions. First, I did convert my SMILES to canonical forms , and the code is as follows:

class MolClean(object):
    def __init__(self):
        self.normizer = MolStandardize.normalize.Normalizer()
        self.lfc = MolStandardize.fragment.LargestFragmentChooser()
        self.uc = MolStandardize.charge.Uncharger()

    def clean(self, smi):
        mol = Chem.MolFromSmiles(smi)
        if mol:
            mol = self.normizer.normalize(mol)
            mol = self.lfc.choose(mol)
            mol = self.uc.uncharge(mol)
            smi = Chem.MolToSmiles(mol,  isomericSmiles=False, canonical=True)
            return smi
        else:
            return None

Second, I can normally handle some SMILES that contain [nH], such as: O=c1[nH]c(=O)c2[nH]cnc2[nH]1 Also, the error SMILES does not have ions , it only contains a nitrogen heterocycle. So maybe your reply does not solve my problem. Thank you again!

SizheQiu commented 1 month ago

Ok, in this case, the most possible reason is that the substrate contains some molecular fingerprints that are not in my training data, and thus the substrate cannot be encoded as features.

SizheQiu / DLTKcat

Some SMILES from E.coli dataset can not be correctly processed. #2