cvignac / DiGress

code for the paper "DiGress: Discrete Denoising diffusion for graph generation"
MIT License
353 stars 71 forks source link

DiGress for another datasets #21

Closed anar-rzayev closed 1 year ago

anar-rzayev commented 1 year ago

Hi Clement,

Thank you for the regular updates on your paper. I actually wanted to ask you about the other possible applications of DiGress instead of Moses, QM9, and so forth mentioned in the experiments. Basically, what I have is the train/valid/test datasets of SMILES strings along with ten reaction types from the US patent literature. I tried to use the abstract_dataset.py to convert strings to graph representations but it caused a lot of bugs.

Do you recommend preparing a MOSES-style .csv file and using moses_dataset.py in a similar fashion (SMILES, Split) for my dataset? FYI, it is USPTO-50k: https://github.com/vsomnath/graphretro/tree/main/datasets/uspto-50k

cvignac commented 1 year ago

Hello, I think that you can mostly follow the code of guacamol_dataset.py. The dataset is simply a list of smiles, so you should be able to process it in a similar way.

Clement

anar-rzayev commented 1 year ago

Thanks Clement for the feedback on guacamol_dataset.py and the updated README file on the usage of the new datasets for DiGress. I wanted to ask a subsequent question regarding the generated samples folder.

If I have a sample of SMILES strings from another dataset that is ready to use for the graph constructions (having exactly similar file to digress_guacamol_smiles.txt or new_train/valid/test.smiles), I was still wondering if I could utilize the guacamol_dataset.py directly instead of creating another .py to generate graph molecules by simply using my own filtered SMILES strings scrapped from the web and saved in .txt file?

Thanks

najwalb commented 1 year ago

@anar-rzayev FYI USPTO-50k actually contains reaction smarts, not molecule smiles. You would most likely need to write your own process function for it, I don't think it would work with MOSES processing script either.