The Sequence-based GPT2 Model is not Training

Hi @XuhanLiu,

I tried to pretrain the GPT2Model with your scripts, but even on a large data set of compounds (see below) the validity of the SMILES does not increase and indeed the SMILES that are output in the log file are invalid and also do not contain the desired fragments. It seems the model can learn atom type distributions, but the connectivity is not working.

I also had to make some adjustments to make the scripts work for GPT2Model, hence the pull request (see the changes in the commit of this PR). Please, check if it makes sense because I struggled to guess the correct vocabulary class to use in particular. The contents of my data folder with an example data set of 10k compounds can be downloaded here. With that data in place I just run the scripts as follows so it should be easy for you to reproduce:

python dataset.py
python train_smiles.py

But even on 1 mil. compounds I still get the following after more than 600 epochs:

Epoch: 658 step: 0 loss: 1.217 valid: 0.145 desire: 0.000 time: 16
Epoch: 659 step: 0 loss: 1.249 valid: 0.117 desire: 0.000 time: 16
Epoch: 660 step: 0 loss: 1.244 valid: 0.125 desire: 0.000 time: 16
Epoch: 661 step: 0 loss: 1.250 valid: 0.117 desire: 0.000 time: 16
Epoch: 662 step: 0 loss: 1.228 valid: 0.121 desire: 0.000 time: 16
Epoch: 663 step: 0 loss: 1.188 valid: 0.141 desire: 0.000 time: 16
Epoch: 664 step: 0 loss: 1.255 valid: 0.129 desire: 0.000 time: 16
Epoch: 665 step: 0 loss: 1.223 valid: 0.102 desire: 0.000 time: 16
Epoch: 666 step: 0 loss: 1.212 valid: 0.113 desire: 0.000 time: 16
Epoch: 667 step: 0 loss: 1.206 valid: 0.133 desire: 0.000 time: 16
Epoch: 668 step: 0 loss: 1.248 valid: 0.129 desire: 0.000 time: 16

Would you be so kind and take a look at what I am doing wrong? Many thanks!

PS: The graph-based model works great so thanks for that!

XuhanLiu / DrugEx

The Sequence-based GPT2 Model is not Training #17