Great work! And some questions about tokenization

chenxran commented 2 years ago

Hi, this work is very interesting and I am trying to incorporate the sub-graph vocabulary construction into my work. Here I am wondering about the specific mechanism of the algorithm dealing with the shared-edge circles. In my test, some sub-graphs (e.g., cccc, ccccc) will cause the error especially when you turn the sub-mol back to the smiles and then turn to mol again (chem_utils.py line 35-36). Is there any method to avoid this error? Thanks!

chenxran commented 2 years ago

For re-implementation:

Example of error molecule:

CC(C)S(=O)c1cn(Cc2c(F)cccc2F)c2sc(-c3ccc(N+[O-])cc3)c(CBr)c2c1=O

Vocab:

{"kekulize": false} CCNC(=O)c1ccccc1 11 1741 CCNC(=O)OC(C)(C)C 10 4925 FC(F)(F)c1ccccc1 10 4233 c1ccc2ncccc2c1 10 3689 CCNCc1ccccc1 10 1972 COC(=O)c1ccccc1 10 1631 O=COCc1ccccc1 10 1606 FC(F)c1ccccc1 9 8522 c1ccc2[nH]ccc2c1 9 2864 c1ccc2nccc2c1 9 2388 NC(=O)c1ccccc1 9 1957 O=SHc1ccccc1 9 1761 FCOc1ccccc1 9 1605 FCc1ccccc1 8 12824 O=Cc1ccccc1 8 10698 COc1ccccc1 8 9141 ccc1ccccc1 8 4480 OCc1ccccc1 8 3301 N#Cc1ccccc1 8 2366 NCc1ccccc1 8 2197 CCc1ccccc1 8 1671 Cc1ccccc1 7 54766 Fc1ccccc1 7 11954 Clc1ccccc1 7 9896 CC(C)(C)OC=O 7 9214 Oc1ccccc1 7 6332 Nc1ccccc1 7 4814 Cc1ccccn1 7 1799 c1ccccc1 6 110036 c1ccncc1 6 5950 c1cncnc1 6 9161 c1ccncc1 6 5950 C1CNCCN1 6 3215 CCNC(C)=O 6 2500 CCOC(C)=O 6 1731 CCNCC 5 9318 ccccn 5 8898 c1cnnc1 5 6157 CCNC=O 5 5299 c1cscn1 5 4087 CCOC=O 5 3642 c1ccsc1 5 3426 c1cncn1 5 3415 CCCC=O 5 2276 c1cnoc1 5 2246 CCCCC 5 2215 ccccc 5 1983 CCC(=O)O 5 1782 cccc 4 398713 CCCC 4 22127 CCNC 4 18245 CC(C)C 4 17269 cccn 4 16517 CCCO 4 8103 CC(N)=O 4 6875 CCC=O 4 5783 ccnc 4 5341 cncn 4 4532 CC(=O)O 4 4277 CC(C)O 4 4011 FC(F)F 4 3326 NCC=O 4 3117 COC=O 4 2988 CSH=O 4 2750 C[Si]C 4 2400 CCCN 4 2370 CCN 3 108106 CCC 3 83520 CCO 3 34808 ccn 3 27601 O=CO 3 23122 O=S=O 3 9809 NC=O 3 7443 FCF 3 6907 CC=O 3 5834 O=[N+][O-] 3 5649 C[Si] 3 3954 ncn 3 3772 ccc 3 2952 CNC 3 2085 cc 2 916776 CC 2 411601 cn 2 84537 CO 2 70939 CN 2 20533 O=S 2 20124 CF 2 10561 [N+][O-] 2 5661 C=O 2 5036 C#N 2 2145 B 1 1411 Br 1 10267 C 1 1721387 Cl 1 27754 F 1 55578 I 1 3891 N 1 232938 O 1 260224 P 1 486 S 1 27025

kxz18 commented 2 years ago

Hi, by tracking the thrown error, I got this from RDKit when creating molecule object from the given SMILES (mol_bpe.py, line 192):

>>> from rdkit import Chem
>>> mol = Chem.MolFromSmiles('CC(C)S(=O)c1cn(Cc2c(F)cccc2F)c2sc(-c3ccc(N+[O-])cc3)c(CBr)c2c1=O')
[16:39:54] SMILES Parse Error: syntax error while parsing: CC(C)S(=O)c1cn(Cc2c(F)cccc2F)c2sc(-c3ccc(N+[O-])cc3)c(CBr)c2c1=O
[16:39:54] SMILES Parse Error: Failed parsing SMILES 'CC(C)S(=O)c1cn(Cc2c(F)cccc2F)c2sc(-c3ccc(N+[O-])cc3)c(CBr)c2c1=O' for input: 'CC(C)S(=O)c1cn(Cc2c(F)cccc2F)c2sc(-c3ccc(N+[O-])cc3)c(CBr)c2c1=O'
>>> mol is None
True

It seems that the input SMILES is illegal and RDKit fails to parse it.

chenxran commented 2 years ago

Sorry, I didn't notice that GitHub automatically transforms the (=O) to a link. Therefore both the example and vocab.txt above are wrong. You can run:

from mol_bpe import Tokenizer
from rdkit import Chem

Chem.MolFromSmiles('CC(C)S(=O)c1cn(Cc2c(F)cccc2F)c2sc(-c3ccc([N+](=O)[O-])cc3)c(CBr)c2c1=O')
tokenizer = Tokenizer("vocab_100.txt")

tokenizer("CC(C)S(=O)c1cn(Cc2c(F)cccc2F)c2sc(-c3ccc([N+](=O)[O-])cc3)c(CBr)c2c1=O")

And my error here is:

AttributeError                            Traceback (most recent call last)
<ipython-input-5-4ebf426e8a59> in <module>
      5 tokenizer = Tokenizer("/data/chenxingran/retrosynthesis/src/megan/data/uspto_50k/vocab_100.txt")
      6 
----> 7 tokenizer("CC(C)S(=O)c1cn(Cc2c(F)cccc2F)c2sc(-c3ccc([N+](=O)[O-])cc3)c(CBr)c2c1=O")

/data/chenxingran/retrosynthesis/src/megan/PS-VAE/ps/mol_bpe.py in __call__(self, mol)
    251 
    252     def __call__(self, mol):
--> 253         return self.tokenize(mol)
    254 
    255     def __len__(self):

/data/chenxingran/retrosynthesis/src/megan/PS-VAE/ps/mol_bpe.py in tokenize(self, mol)
    224                     ad_mat[i][j] = ad_mat[j][i] = 1
    225         group_idxs = [x[1] for x in res]
--> 226         return Molecule(smiles, group_idxs, self.kekulize)
    227 
    228     def idx_to_subgraph(self, idx):

/data/chenxingran/retrosynthesis/src/megan/PS-VAE/ps/molecule.py in __init__(self, smiles, groups, kekulize)
     79             subgraph_mol = get_submol(rdkit_mol, group, kekulize)
     80             subgraph_smi = mol2smi(subgraph_mol)
---> 81             atom_mapping = get_submol_atom_map(rdkit_mol, subgraph_mol, group, kekulize)
...
---> 38     for atom in submol.GetAtoms():
     39         if atom.GetSymbol() == 'N' and (atom.GetExplicitValence() == 3 and atom.GetFormalCharge() == 1) or atom.GetExplicitValence() < 3:
     40             atom.SetNumRadicalElectrons(0)

AttributeError: 'NoneType' object has no attribute 'GetAtoms'

It seems that the RDKit found some sub-graphs of the molecule to be invalid.

The vocab.txt:

{"kekulize": false}
CCNC(=O)c1ccccc1    11  1741
CCNC(=O)OC(C)(C)C   10  4925
FC(F)(F)c1ccccc1    10  4233
c1ccc2ncccc2c1  10  3689
CCNCc1ccccc1    10  1972
COC(=O)c1ccccc1 10  1631
O=COCc1ccccc1   10  1606
FC(F)c1ccccc1   9   8522
c1ccc2[nH]ccc2c1    9   2864
c1ccc2nccc2c1   9   2388
NC(=O)c1ccccc1  9   1957
O=[SH](=O)c1ccccc1  9   1761
FCOc1ccccc1 9   1605
FCc1ccccc1  8   12824
O=Cc1ccccc1 8   10698
COc1ccccc1  8   9141
ccc1ccccc1  8   4480
OCc1ccccc1  8   3301
N#Cc1ccccc1 8   2366
NCc1ccccc1  8   2197
CCc1ccccc1  8   1671
Cc1ccccc1   7   54766
Fc1ccccc1   7   11954
Clc1ccccc1  7   9896
CC(C)(C)OC=O    7   9214
Oc1ccccc1   7   6332
Nc1ccccc1   7   4814
Cc1ccccn1   7   1799
c1ccccc1    6   110036
c1ccncc1    6   5950
c1cncnc1    6   9161
c1ccncc1    6   5950
C1CNCCN1    6   3215
CCNC(C)=O   6   2500
CCOC(C)=O   6   1731
CCNCC   5   9318
ccccn   5   8898
c1cnnc1 5   6157
CCNC=O  5   5299
c1cscn1 5   4087
CCOC=O  5   3642
c1ccsc1 5   3426
c1cncn1 5   3415
CCCC=O  5   2276
c1cnoc1 5   2246
CCCCC   5   2215
ccccc   5   1983
CCC(=O)O    5   1782
cccc    4   398713
CCCC    4   22127
CCNC    4   18245
CC(C)C  4   17269
cccn    4   16517
CCCO    4   8103
CC(N)=O 4   6875
CCC=O   4   5783
ccnc    4   5341
cncn    4   4532
CC(=O)O 4   4277
CC(C)O  4   4011
FC(F)F  4   3326
NCC=O   4   3117
COC=O   4   2988
C[SH](=O)=O 4   2750
C[Si]C  4   2400
CCCN    4   2370
CCN 3   108106
CCC 3   83520
CCO 3   34808
ccn 3   27601
O=CO    3   23122
O=S=O   3   9809
NC=O    3   7443
FCF 3   6907
CC=O    3   5834
O=[N+][O-]  3   5649
C[Si]   3   3954
ncn 3   3772
ccc 3   2952
CNC 3   2085
cc  2   916776
CC  2   411601
cn  2   84537
CO  2   70939
CN  2   20533
O=S 2   20124
CF  2   10561
[N+][O-]    2   5661
C=O 2   5036
C#N 2   2145
B   1   1411
Br  1   10267
C   1   1721387
Cl  1   27754
F   1   55578
I   1   3891
N   1   232938
O   1   260224
P   1   486
S   1   27025

kxz18 commented 2 years ago

I see. This is because in unkekulized form each atom is assigned a label indicating whether it is in an aromatic system (lower case atom symbols in the SMILES mean they are in aromatic systems), and fragments containing only parts of the aromatic systems like cccc is actually illegal SMILES. If replacing aromatic bonds with alternating single/double bonds is acceptable in your situation, the easiest way to fix this error is to set kekulize=True when constructing the vocabulary. I am also working on other ways to fix this bug for better supporting of unkekulized molecule representation.

kxz18 commented 2 years ago

Hi, I have fixed this bug in the latest commit e2977a1e38ff6559a93de5c0409e3607e2f0d563. You can pull the updates and have a try~

chenxran commented 2 years ago

Perfect! It does work.

When I utilize the unkekulized version to test the tokenization yesterday, I found another issue. Can u give a try for the following example utilizing the same vocabulary? Thanks!

C[C@@H](c1cccc(C(C)(C)O[SiH2]C(C)(C)C)n1)N(CCc1ccccc1)C(=O)OC(C)(C)C

The error I got here is

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/data/chenxingran/retrosynthesis/src/megan/PS-VAE/ps/mol_bpe.py", line 226, in tokenize
    return Molecule(smiles, group_idxs, self.kekulize)
  File "/data/chenxingran/retrosynthesis/src/megan/PS-VAE/ps/molecule.py", line 80, in __init__
    subgraph_smi = mol2smi(subgraph_mol)
  File "/data/chenxingran/retrosynthesis/src/megan/PS-VAE/ps/utils/chem_utils.py", line 15, in mol2smi
    return Chem.MolToSmiles(mol)
Boost.Python.ArgumentError: Python argument types in
    rdkit.Chem.rdmolfiles.MolToSmiles(NoneType)
did not match C++ signature:
    MolToSmiles(RDKit::ROMol mol, bool isomericSmiles=True, bool kekuleSmiles=False, int rootedAtAtom=-1, bool canonical=True, bool allBondsExplicit=False, bool allHsExplicit=False, bool doRandom=False)

It seems that the tokenizer here treats Si itself as a subgraph, while rdkit seems not to be able to deal with such a case.

kxz18 commented 2 years ago

Hi~ I've added the support for Si atoms in the latest commit d5c09d695b044ecc8fda67b096164fb0663e940c. It seems that the SMILES of a single Si atom should be [Si] instead of Si, which is different from other atoms. Therefore I added a special branch for Si in the get_submol function.

chenxran commented 2 years ago

Thanks for the update!

THUNLP-MT / PS-VAE

Great work! And some questions about tokenization #1