Large performance gap in MD17/22 dataset

atomicarchitects / equiformer_v2

[ICLR 2024] EquiformerV2: Improved Equivariant Transformer for Scaling to Higher-Degree Representations

https://arxiv.org/abs/2306.12059

MIT License

218 stars 27 forks source link

Large performance gap in MD17/22 dataset #12

Closed TommyDzh closed 1 week ago

TommyDzh commented 7 months ago

Thank you for the great work EquiformerV2. When I test its performance on MD17/22 dataset, I find it lags far behind SOTA models like VisNet. For example, in MD22_AT_AT, when VisNet val loss for E converges to 0.14, F converges to 0.17. While for EqV2 E val loss is 4.7 for E and 5.1 for F. I follow the setting in oc20/configs/s2ef/all_md/equiformer_v2/equiformer_v2_N@8_L@4_M@2_31M.yml. Are there things I need to modify for adopting EqV2 in MD datasets? Thanks.

yilunliao commented 7 months ago

Hi @TommyDzh

Can you check whether the training loss/MAE of EquiformerV2 matches that of VisNet? As in the config you mentioned, we used regularizations like Dropout (alpha_drop) and stochastic depth (drop_path). These regularizations help in OC20 but can prevent training to converge in other datasets. You can check the paper of Equiformer to see how I set some hyper-parameters.
Moreover, for fair comparison, it would be simpler to use the same radial basis functions and cutoff radius.
I think you are using gradient methods to predict forces. If yes, I think you need to remove .detach() as here, here and here. These detach() can make the gradients with respect to relative positions to zeros and make the network only use the relative distance (the magnitude of relative positions) to predict forces (we still have gradients in the radial basis functions).

Feel free to ask if you have other specific questions.

TommyDzh commented 7 months ago

Thank you for your reply!

For VisNet I use both MSE for E and F. For EqV2, I have tried both settings for VisNet and the one in oc20/configs/s2ef/all_md/equiformer_v2/equiformer_v2_N@8_L@[4_M@2_31M.yml. But the trends are similar. I will further check and follow the hyper-parameters in Equiformer.
I will check it.
I have tried both regress_force in EqV2 and gradient to predict foreces and see similar gaps. I will remove y .detach() and try gradient-based method again. But I wonder, according to your experience, how much will regress_force lags behind gradient-based method? Will different force prediction methods cause such a large gap?

Anyway, your prompt response is greatly appreciated! I will give you further feedbacks when I have corrected all the things above!

TommyDzh commented 7 months ago

For your reference, here is the val loss curve. Blue line is EqV2 using regress forces, green line is VisNet using gradient-based force prediction

yilunliao commented 7 months ago

For VisNet I use both MSE for E and F. For EqV2, I have tried both settings for VisNet and the one in oc20/configs/s2ef/all_md/equiformer_v2/equiformer_v2_N@8_L@[4_M@2_31M.yml.

I don't understand this. Also the link is broken. What I said is strong regularization can prevent fitting the training set, so you need to check the results in the training set not the validation set.

Using direct methods is better than gradient methods as mentioned by some work on OC20. I don't think there should be such a gap if there is no bug.