Open Nourollah opened 2 weeks ago
Hi Masoud,
I would recommend using the CLI to reproduce the benchmarks.
To train a model on QM9, you can simply use the command spktrain experiment=qm9_atomwise
.
Similarly you can train models on MD17. It is possible that some default hyperparameters might have changed by now. So please make sure that the hyperparameters in the config.yaml
file are identical to those used in the paper.
Best regards, Jonas
Hello Jonas,
I spotted the thread this thread and am aware that I can utilize the CLI to replicate the benchmarks. However, I am encountering difficulties in reproducing the benchmarks directly through the code. I have verified the hyperparameters, and they align with those presented in the paper. If there exists a discrepancy between the code and the CLI, I would appreciate insights into the potential causes of this issue. Additionally, I am curious about the possibility of incorporating another model into the project for comparative analysis with PainNN or SchNet.
In my opinion, there should be a mechanism to replicate the CLI-generated results through the code or Jupyter Notebooks.
Warm regards, Masoud
Hello,
I am updating this thread. I have double-checked everything, and it appears that I am reproducing the results for MD17 correctly. However, I obtained better results than the ones reported in the paper, which used three different random seeds. This suggests that I may be making a mistake somewhere or that the project has been updated, and the hyperparameters or models are now working more effectively due to a change in the version of PyTorch or another factor. For QM9, it appears that the data is slightly different from the data that I can use with PyTorch-Geometric or the Schnet repository, which reported on the paper.
Hi @Nourollah,
I am happy to hear that you could reproduce the results of the paper. Of course you can do this without the CLI, but one needs to be careful to set all hyperparameters correctly. Differences in performance (usually providing better results) can still occur, because some default parameters have changed in SchNetPack. This mainly affects the activation functions, number of interaction blocks, feature dimensions and also other training parameters.
What exactly do you mean by differences in the QM9 dataset? The dataset has an argument remove_uncharacterized
which lets you filter out invalid structures. This might cause the differences.
Best, Stefaan
Hello @stefaanhessmann,
Thank you for your detailed explanation. I concur that both methods utilize the same base code, leading to the expectation of similar results. However, upon reviewing the hyperparameters you mentioned, they are remarkably similar to those presented in the paper. Furthermore, I am employing the same code for QM9 and MD17. Consequently, I anticipate obtaining close results to the paper for QM9, similar to those obtained for MD17 (which is marginally superior to the paper and I believe this discrepancy should not be attributed to the identical hyperparameters used).
Following your suggestion about remove_uncharacterized
portion of the dataset, I did experiments to train the PaiNN with and without this subset. Based on the obtained results, I discern no significant difference between the two approaches. Notably, the results obtained are substantially inferior to those presented in the paper (approximately 10 times worse with seed 2000 and 12 times worse with seed 42). I have ensured that the number of interaction blocks, activation functions, and feature dimensions aligns precisely with the repo addressed in the PaiNN paper (Schutt et al., 2021).
Upon plotting a chart depicting the mean and standard deviation of U0 for the QM9 dataset in conjunction with the SchNetPack dataset and the PyTorch geometric dataset, I observed distinct differences. I speculate that this discrepancy may be the root cause of the issue. Additionally, I attempted to modify the units, but this returned no noticeable change to reproducing the paper results.
Ok, we will further look into this. If the model is worse by a factor for 10 or 12, it might be related to some issues with unit conversions? How do you evaluate the test data? Did you check, if maybe some unit conversion from kcal/mol to eV or vice versa. Could you maybe add the exact test results that you observed? And what version of SchNetPack did you use?
I will link @ken-sc01 here
Hi,
I am having trouble reproducing the QM9 and MD17 results in the painn paper (Schutt et. al. 2021). Is there a standard way I can do this? I have managed to reproduce the QM9 results original repository including with the paper on the ICML website.
The code in the tutorials (Jupyter notebooks) doesn't even seem to approximately reproduce the values for both datasets.
I would really appreciate some help.
Best wishes, Masoud