Reproduce SMILES strings from the PyTorch_Geometric Dataset

TianBian95 commented 1 year ago

Dear authors,

I want to reproduce a VAE model based on your code, but I can't reproduce the SMILES strings in the training set based on the PyTorch Geometric Data object. My code is as follows:

from src.analysis.rdkit_functions import build_molecule, mol2smiles
import src.utils as utils

class VAE(pl.LightningModule):

    ......

    def training_step(self, data, i):
        dense_data, node_mask = utils.to_dense(data.x, data.edge_index, data.edge_attr, data.batch)
        for bi in range(dense_data.X.size(0)):
            atom_types, edge_types = dense_data.X[bi], dense_data.E[bi]
            types_idx = torch.argmax(atom_types, dim=1)
            edge_idx = torch.argmax(edge_types, dim=2)
            SMILES_str = mol2smiles(build_molecule(types_idx, edge_idx, self.dataset_infos.atom_decoder))
            print(SMILES_str)

According to my statistical results as shown in the Table below, especially on the Moses and Guacamol datasets, it was found that most SMILES strings contain '.'. This has led to my VAE model being trained poorly, resulting in low valid scores.

	Total	SMILES_str == None	'.' in SMILES_str
QM9	97,734	675	16,108
Moses	1,584,663	165,151	1,384,182
Guacamol	1,118,633	53,426	1,055,755

Although I can directly download training files containing SMILES strings from other resources, this raises my concern about whether the dataset is properly processed into graph data structures and whether the models trained on these graph structures are reasonable. Do you have any ideas on how this issue happened? Thank you!

TianBian95 commented 1 year ago

In addition, I noticed that you set the preprocess parameter in the Dataset class to handle the case where SMILES_str == None. This seems to indicate that the method of converting graph data and SMILES strings using rdkit is not completely reversible. However, what should we do about the SMILES strings that contain '.'? If we remove them, the training dataset will become very small. Can we only train models like VAE on the original SMILES strings?

cvignac commented 1 year ago

Hello, my visualizations show that most generated molecules are connected, so this seems to be a problem with your sample code. Have you made sure that you correctly process the molecules in the batch ? For example, you do not seem to mask the node and edges to account for the varying graph sizes.

cvignac / DiGress

Reproduce SMILES strings from the PyTorch_Geometric Dataset #60