Question about applying split method on new datasets

ZhaoningYu1996 commented 6 months ago

Hi,

I am trying to apply the split method on the HIV dataset to other molecule datasets like BACE. However, the split results I get are not promising. The performance on size covariate ood splits is higher than open graph benchmark data split. I want to know if is there anything I am missing.

Thank you!

CM-BF commented 6 months ago

Hi Zhaoning,

Thank you for asking! The performance will definitely differ from dataset to dataset. For example, if the size of the smallest molecule in your training set is similar to the size of the smallest molecule in your test set, then this split does not introduce much OOD problem. There are lots of factors involved, including original data distributions and model specifications. There can be cases where your test data distribution is relatively concentrated and well covered by the training distribution. Here, "covered" means the training data that is similar to your test distribution is diverse enough or does not include strong spurious correlations, so that your model can transfer the non-spurious correlations to your test distribution easily. This is a possible situation. For any further analyses, more detailed observations about your splits are necessary. :)

Best regards, Shurui Gui

ZhaoningYu1996 commented 6 months ago

Thank you for your quick response! It helps a lot!

divelab / GOOD

Question about applying split method on new datasets #24