Closed SsGood closed 2 years ago
Thank you!
This is pretty easy with scipy.stats
. You train your model multiple times, with a different initialization and train/val split each time. For each you calculate the accuracy on the same test set. You should fix the set of seeds you use for the train/val splits, since otherwise your variance will be to high. You can then use scipy.stats.ttest_rel
between two sets of results (two models). The results are related because you use the same set of seeds for the train/val splits.
The final statement is then "This model is significantly different than that model for this dataset, across data splits and model initializations." You cannot make proper statements for other datasets or a class of datasets.
Note that you should only look at your test set once and only use one final model for this test. Otherwise you are doing multiple testing and the final result is invalid.
For model development you should only look at your validation set and not do t-tests like this.
Thanks for your reply, it helps me a lot!
Hi, it's a wonderful job and the opensource code is clear! However, I am a bit confused about some of the experimental procedures due to my lack of knowledge in significance testing. Specifically, in Sect.5, how to calculate the p-values of a paired t-test for your main claims. Thus, would you be able to give a detailed explanation or share the calculating codes? I would appreciate it!