code seems to use the test dataset to select best checkpoint

Hi, you mentioned question about "the selection of validation sets" is an unresolved issue in this field. And there is no fixed answer. In the original paper, we take the top 3 best results of different test sets, and then take the Avg. The reasons are as follows:

We found that the best generalization point of different models (from FF training to other tests) often gets the best on the test at the earliest time, such as the first epoch, but at this time it is not fully fitted on FF. So if we choose the best of FF, it is very likely (almost) that this ckpt is not the best on other test sets. That is, when the effect of the FF++ validation set is improved, it is likely that it has been overfitted to some easy-patterns, and the cdf on the test is very poor. There are a few exceptional models, such as SBI, which can be trained to the end (after the 30th epoch) before the best appears, but such examples are relatively rare;
Why is it the Avg of the top 3? If only the best is considered, there may be some random factors that interfere with the evaluation, and the results are not stable enough. Taking the top 3 is to get more stable results.
How are other works done? (1) Early work was to train on FF, take the best of FF and then test on other sets; (2) OST and SLADD directly took the best of test, because they believed that the results of FF validation set were not referenceable because the distribution difference between training and testing was too large; (3) SBI took the top 5 ckpts on FF validation set and saved them for inference. So we thought about it at the time and felt that this might be a compromise.

For now, we have rethought this problem and re-think that a better way may be to treat these test sets such as CDF as validation sets, and then take the ckpt with the highest avg in all test sets (we think that the best effect in these validation sets is likely to be in other unseen test sets). This evaluation method and ckpt selection method have also been updated to the latest DeepfakeBench code. We are going to expand the results of this evaluation later, and we will also use this indicator to provide an evaluation and further clarify this point.

SCLBD / DeepfakeBench

code seems to use the test dataset to select best checkpoint #91