Closed thegodone closed 2 years ago
I added a max min section to the score output. But since we have a cross-validation there is already some average of the final error. However, it is not really fair to report the best validation error in cross-validation since you should not know it as it is actually the test data, which should be all blind. We could split off a little separate validation data in each set for early stopping, but it is not guaranteed that it will be also good in test set.
For now, I only see a major decrease of test error in PROTEINSDataset which starts at 0.8 accuracy and falls down to like 0.7 during fitting...
Indeed, generally I merge the 5CV individual validation outputs into the full dataset and compute the rmse only once (there is no more std in this case). In general (most of the case) the 5CV mean rsme value is very close to this combine RMSE value. Except for Freesolv where the split has a dramatic influence in the performances. But if you consider the earlystopping as a good technic which I do, you don't want to overfit too much on the training dataset meaning that you start to memorise and you also start to be less "generalisable" so long trainings are not "efficient" for me. It's why the best performance on a validation set should be use to store the best-model and apply to an extra test set.
Okay, sure. I just roughly set the hyperparameter so that it does never really overfits too much. Yes, the splits matter. We always use random splits and I also write it in the text for each dataset, I think there is one split for e.g. ESOL that gets like 25% better performance. I would have to run multiple random 5-fold CV and then average all. But this is a little more time consuming.
I will add a Best RMSE column for the small datasets, which are split dependent.
Okay, I added the column with some description as you proposed. I will also from now on fix the random state for the K-fold split, meaning they ideally should all train on the same random split, which should increase reproducibility and comparability
It makes lot of sense now, I will close this issue
Can you do the same for Tox21, max performance at any epochs too ?
Done
Okay, I will do add them for all stats, to show them you simply have to copy results folder and run python3 summary.py --min_max True
When you run the tests I saw that you use the last epoch result for Example in Freesolv DMPNN. I wonder if you can provide a way to show and save also the best score epoch value which is a natural way to do so with EarlyStopping without defining a true ES callback by the fact that you have generally set lr_cycle processes.
What do you think ? I would not remove the 300 last results but add a new column called "best results". Specially cause the FreeSolv split performance is so affected by the seed.