cvignac / DiGress

code for the paper "DiGress: Discrete Denoising diffusion for graph generation"
MIT License
349 stars 73 forks source link

Reproduce SMILES strings from the PyTorch_Geometric Dataset #60

Closed TianBian95 closed 10 months ago

TianBian95 commented 1 year ago

Dear authors,

I want to reproduce a VAE model based on your code, but I can't reproduce the SMILES strings in the training set based on the PyTorch Geometric Data object. My code is as follows:

from src.analysis.rdkit_functions import build_molecule, mol2smiles
import src.utils as utils

class VAE(pl.LightningModule):

    ......

    def training_step(self, data, i):
        dense_data, node_mask = utils.to_dense(data.x, data.edge_index, data.edge_attr, data.batch)
        for bi in range(dense_data.X.size(0)):
            atom_types, edge_types = dense_data.X[bi], dense_data.E[bi]
            types_idx = torch.argmax(atom_types, dim=1)
            edge_idx = torch.argmax(edge_types, dim=2)
            SMILES_str = mol2smiles(build_molecule(types_idx, edge_idx, self.dataset_infos.atom_decoder))
            print(SMILES_str)

According to my statistical results as shown in the Table below, especially on the Moses and Guacamol datasets, it was found that most SMILES strings contain '.'. This has led to my VAE model being trained poorly, resulting in low valid scores.

Total SMILES_str == None '.' in SMILES_str
QM9 97,734 675 16,108
Moses 1,584,663 165,151 1,384,182
Guacamol 1,118,633 53,426 1,055,755

Although I can directly download training files containing SMILES strings from other resources, this raises my concern about whether the dataset is properly processed into graph data structures and whether the models trained on these graph structures are reasonable. Do you have any ideas on how this issue happened? Thank you!

TianBian95 commented 1 year ago

In addition, I noticed that you set the preprocess parameter in the Dataset class to handle the case where SMILES_str == None. This seems to indicate that the method of converting graph data and SMILES strings using rdkit is not completely reversible. However, what should we do about the SMILES strings that contain '.'? If we remove them, the training dataset will become very small. Can we only train models like VAE on the original SMILES strings?

cvignac commented 1 year ago

Hello, my visualizations show that most generated molecules are connected, so this seems to be a problem with your sample code. Have you made sure that you correctly process the molecules in the batch ? For example, you do not seem to mask the node and edges to account for the varying graph sizes.