Closed TianBian95 closed 10 months ago
In addition, I noticed that you set the preprocess
parameter in the Dataset class to handle the case where SMILES_str == None
. This seems to indicate that the method of converting graph data and SMILES strings using rdkit is not completely reversible. However, what should we do about the SMILES strings that contain '.'? If we remove them, the training dataset will become very small. Can we only train models like VAE on the original SMILES strings?
Hello, my visualizations show that most generated molecules are connected, so this seems to be a problem with your sample code. Have you made sure that you correctly process the molecules in the batch ? For example, you do not seem to mask the node and edges to account for the varying graph sizes.
Dear authors,
I want to reproduce a VAE model based on your code, but I can't reproduce the SMILES strings in the training set based on the PyTorch Geometric Data object. My code is as follows:
According to my statistical results as shown in the Table below, especially on the Moses and Guacamol datasets, it was found that most SMILES strings contain '.'. This has led to my VAE model being trained poorly, resulting in low valid scores.
Although I can directly download training files containing SMILES strings from other resources, this raises my concern about whether the dataset is properly processed into graph data structures and whether the models trained on these graph structures are reasonable. Do you have any ideas on how this issue happened? Thank you!