OptiMaL-PSE-Lab / EvalRetro

A repository for evaluating single-step retrosynthesis algorithms
MIT License
13 stars 2 forks source link

missing par.json #8

Closed thegodone closed 1 week ago

thegodone commented 3 weeks ago

running the plotting.py script produce a error cause par.json is not provided.

Also jupyter installation is needed to run plotting.

Installation: on linux

(evalretro) ubuntu@l4-ubuntu2204:~/EvalRetro$ python plotting.py
Traceback (most recent call last):
  File "/home/ubuntu/EvalRetro/plotting.py", line 46, in <module>
    with open(os.path.join(config_path,"par.json"), 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: '/home/ubuntu/EvalRetro/config/par.json'
fredhastedt commented 2 weeks ago

Thanks for raising this issue!

I corrected the reference to the config file - running the plotting.py file should now work to reproduce the plots for the USPTO 50k results. I added a line in the .yml file to directly install the necessary dependencies (IPython). Hopefully, this should resolve the missing dependency issue. These changes can be seen in commit 784f90b.

Please let me know in case this is not working and/or if you find any other issues.

thegodone commented 2 weeks ago

I have a question regarding the speed of the Round-trip, this is really the bottleneck of the workflow. I used both linux or mac arm env and basically it is 1-3hrs run. Is there a way to optimise this part ?

I think I may have an issue there now ?

2024-07-09 16:09:27,629 - __main__ - ERROR - MEGAN does not have a Round-trip.csv file
2024-07-09 16:09:27,629 - __main__ - ERROR - GraphRetro does not have a Round-trip.csv file
2024-07-09 16:09:27,629 - __main__ - ERROR - RetroXpert does not have a Round-trip.csv file
2024-07-09 16:09:27,629 - __main__ - ERROR - G2Retro does not have a Round-trip.csv file
2024-07-09 16:09:27,629 - __main__ - ERROR - Chemformer does not have a Round-trip.csv file
2024-07-09 16:09:27,629 - __main__ - ERROR - Graph2Smiles does not have a Round-trip.csv file
2024-07-09 16:09:27,629 - __main__ - ERROR - Retroformer does not have a Round-trip.csv file
2024-07-09 16:09:27,629 - __main__ - ERROR - GTA does not have a Round-trip.csv file
2024-07-09 16:09:27,630 - __main__ - ERROR - TiedTransformer does not have a Round-trip.csv file
2024-07-09 16:09:27,630 - __main__ - ERROR - GLN does not have a Round-trip.csv file
2024-07-09 16:09:27,630 - __main__ - ERROR - LocalRetro does not have a Round-trip.csv file
2024-07-09 16:09:27,630 - __main__ - ERROR - MHNReact does not have a Round-trip.csv file
fredhastedt commented 2 weeks ago
  1. The bottleneck for the round-trip metric is caused by the rexgen (GCN) forward model. Unfortunately, inference for this model is done for each reactant set at a time, rather than using a batch of reactant sets. This additional for loop will lead to a larger computational overhead (probably around 3x). To fix this, one could consider using a different forward prediction model. I was considering adding another model (such as LocalTransform) to the src code.

One would however have to redo the experiments on the USPTO-50k and Pararoutes datasets for the different retrosynthesis algorithms. Nonetheless, it would be interesting to see if the results agree for different forward prediction models.

  1. It seems to me that you tried to run plotting.py without recreating the results for the USPTO-50k datasets. The files on the figshare only consist of the predictions. To plot the figures and re-create the tables (e.g. for the round-trip accuracy), one has to generate the .csv files (results) by running python main.py --k_back 10 --k_forward 2 --invsmiles 20 --fwd_model 'gcn' --config_name 'raw_data.json' --quick_eval False
fredhastedt commented 2 weeks ago

Adding to the previous comment: Looking at LocalTransform's GitHub, it seems that inference is also done on a single reactant basis. To fix the bottleneck for computational speed, one would either have to rewrite the source code to allow for batch inference or use an existing model that can take a batch of different reactant sets during inference, e.g., 10 reactant sets per product molecule. Are you aware of any such forward prediction model? If so, I could try to integrate it within the workflow.

thegodone commented 2 weeks ago

You could use our direct model (Table 4 Comparison of recently published methods for direct synthesis prediction on the USPTO-MIT set) from AT-Transformer paper https://www.nature.com/articles/s41467-020-19266-y. it seems not to be available in https://github.com/bigchem/synthesis I can ask igor to add it. What do you think ? I may also have a keras version of it indeed (compatible to my arm64 mps cards).

fredhastedt commented 2 weeks ago

That would be great, thanks!

I may have to re-train the AT-model on the USPTO-MIT without reagent information, as retrosynthesis predictions usually do not include reagents. It is a possibility that the direct forward model's accuracy decreases without the reagent information.

fredhastedt commented 1 week ago

@thegodone

Hi - I have just made some changes to the dev_v3.10 branch where I pushed the LocalTransform (localt_forward) model to the repo. This is yet not fully tested (and ideally I would redo the rt-accuracy metric analysis for this forward model). I will attempt to fully integrate this model within the next couple of days.

In terms of speed-up: The model runs roughly 2x faster than the gcn_forward model as it uses batch inference. Hope this helps!

thegodone commented 1 week ago

Hi Friedrich,

That would be so great if I can have it before the end of the week.

Thanks a lot for your help.

fredhastedt commented 1 week ago

Hi Guillaume,

I have now merged the pull request from the development branch. In particular, there are two major changes:

  1. The _gcnforward algorithm for the USPTO-50k database now takes around 50 minutes for inference time.
  2. The LocalTransform forward model was added as _lctforward. It takes around 1 hr for inference time, but is more accurate compared to gcn_forward.

I will still have to test the rt-accuracy results for the_lctforward model on the USPTO-50k. This will be done within the coming days and results will be added to the README.md file.

In the meantime, if you find any bugs, please do let me know!

fredhastedt commented 1 week ago

@thegodone

I will close this issue, as the initial problem was fixed. In case you find another unrelated bug, please feel free to open another one. Thanks a lot!