Closed anar-rzayev closed 1 year ago
Hello,
I think that you can mostly follow the code of guacamol_dataset.py
. The dataset is simply a list of smiles, so you should be able to process it in a similar way.
Clement
Thanks Clement for the feedback on guacamol_dataset.py and the updated README file on the usage of the new datasets for DiGress. I wanted to ask a subsequent question regarding the generated samples folder.
If I have a sample of SMILES strings from another dataset that is ready to use for the graph constructions (having exactly similar file to digress_guacamol_smiles.txt or new_train/valid/test.smiles), I was still wondering if I could utilize the guacamol_dataset.py directly instead of creating another
Thanks
@anar-rzayev FYI USPTO-50k actually contains reaction smarts, not molecule smiles. You would most likely need to write your own process function for it, I don't think it would work with MOSES processing script either.
Hi Clement,
Thank you for the regular updates on your paper. I actually wanted to ask you about the other possible applications of DiGress instead of Moses, QM9, and so forth mentioned in the experiments. Basically, what I have is the train/valid/test datasets of SMILES strings along with ten reaction types from the US patent literature. I tried to use the abstract_dataset.py to convert strings to graph representations but it caused a lot of bugs.
Do you recommend preparing a MOSES-style .csv file and using moses_dataset.py in a similar fashion (SMILES, Split) for my dataset? FYI, it is USPTO-50k: https://github.com/vsomnath/graphretro/tree/main/datasets/uspto-50k