Some questions about the training and evaluation stages

mouthful commented 4 weeks ago

Hello, Thank you for sharing this incredible codebase! It’s so exciting and well-constructed, and I’m eager to dive into your technique. I have a few questions about the training and evaluation stages and would appreciate any insights you can offer when you are convenient:

Training Scripts: It seems there are no scripts provided for reproducing the training process. Could you share any scripts (or settings consistent with the paper’s implementation details) for training the OA-ReactDiff and confidence models?
single_frag_only parameter: From the training code from oa_reactdiff/trainer/train_ts1x.py, I noticed that the parameter single_frag_only is set to True (line 89), which seems to filter out reaction cases with multiple-fragment reactants/products. Is this setting intended for the standard training setup? If so, how to ensure the trained model can handle multiple-fragment reactions that may appear in the test set, as discussed in the paper.
Data Splitting: I noticed that only training and validation datasets are provided in oa_reactdiff/data/transition1x/. After a quick review of the code, I guess you possibly employ the validation dataset in both the validation and test stages, which seems to violate the description in the paper and might lead to a slight risk of data leakage. Given that the model shows no signs of overfitting (as mentioned in the paper), this may have a minimal impact on reported metrics. However, I still want to know your opinions about the data split and evaluation.
Evaluation Pipeline: Could you share a detailed pipeline for reproducing the main evaluation results in the paper? For example, the current evaluation script (oa_reactdiff/evaluate/evaluate_ts_w_rp.py) doesn’t integrate both diffusion sampling and the confidence model’s sampling recommendation, and some hyperparameters differ slightly from the paper. Additionally, I would appreciate any guidance on reproducing Figure 3, especially on reaction selection and local TS geometry optimization.

Thank you very much for your help!

Best regards, He Zhang

chenruduan commented 4 weeks ago

Hey He,

Thanks for your interests!

Training Scripts: We used oa_reactdiff/trainer/train_ts1x.py for training.
Great observation. single_frag_only was False during the training of all reactions.
The current "validation set" is actually the "test set" in conventional definition. We did not do any hyperparams tuning as we adopted the setting from the original LEFTNet paper directly. Therefore, there is no need for validation set, and there would be no data leakage concerns.
All the evaluation functions are in oa_reactdiff/evaluate as piecemeal scripts. As we did the structure generations and confidence ranking step by step, we do not have a pipeline at hand that does everything in one shot, which would require more engineering efforts.

Let me know if you have further questions!

Chenru

mouthful commented 4 weeks ago

Hey Chenru,

Thank you for your prompt reply—it’s been very helpful! I have two follow-up questions about the evaluation process:

Checkpoint Selection for Inference: I noticed some inconsistencies in the checkpoint files used across different evaluation scripts. For example, oa_reactdiff/evaluate/run_eva_ts_e_rp.sh uses the leftnet_2304 model, while oa_reactdiff/evaluate/evaluate_rmsd_vs_ediff.py uses leftnet_2074, and there is also a checkpoint file named pretrained-ts1x-diff.ckpt in the codebase. Could you clarify which checkpoint was used to generate the paper’s results?
Noise Schedule Discrepancies: There seems to be a slight difference in the noise schedule settings between training and evaluation scripts. The training script defaults to a cosine noise schedule, while the evaluation scripts use a polynomial noise schedule, with polynomial power values set to 2 in oa_reactdiff/evaluate/run_eva_ts_e_rp.sh and 2.5 in oa_reactdiff/evaluate/run_confidence_sample.sh. Do you have any guidance on how the noise schedules were tuned for optimal performance with the OA-ReactDiff model?

Thanks again for your assistance! He

chenruduan commented 3 weeks ago

Hi He,

leftnet_2074 should be the pretrained-ts1x-diff.ckpt, which is what we finally used.
It is recommended to use the same schedule in training and test. So please use a power of 2. I think we did some testing earlier on, and that power of 2.5 should be a legacy.

Chenru

mouthful commented 3 weeks ago

Hi Chenru,

Thanks for your response.

He

chenruduan / OAReactDiff

Some questions about the training and evaluation stages #8