YZY-stack / DF40

Official repository for the next-generation deepfake detection dataset (DF40), comprising 40 distinct deepfake techniques, even the just released SoTAs. Our work has been accepted by NeurIPS 2024.
87 stars 2 forks source link

Incorrect split division #14

Open yermandy opened 2 weeks ago

yermandy commented 2 weeks ago

I noticed your test and val splits intersect in the JSON files you provided. This is a very unfortunate situation!

YZY-stack commented 3 days ago

Hi. Thanks for your great question. We would like to clarify the following:

In our dataset, the validation and test sets are identical. This decision is based on the following rationale:

The most significant challenge in deepfake detection is the generalization issue, which occurs when a model is trained on one fake domain and tested on another. In contrast, evaluating within the same domain is relatively straightforward, as most previous studies have achieved a 99% AUC when both the training and testing fake domains are identical.

To address the generalization issue, our benchmark incorporates three distinct scenarios/protocols:

In the generalization scenarios, it makes sense to use the within-domain testing data as the evaluation set since the actual testing data is outside the training domains (similar to the domain generalization scenarios). We haven't specifically considered the scenario where both the data domains and forgery methods are the same. However, we plan to address this in the near future by creating a separate validation set to evaluate this case.

This approach will provide a more comprehensive evaluation of the models' performance across various conditions.

yermandy commented 2 days ago

@YZY-stack The main point of the validation set is to tune hyperparameters, use it for early stopping and model selection. When you say that

In our dataset, the validation and test sets are identical

it makes no sense why you add both of them to json files. You should either call both the validation set or the test set otherwise you confuse the community which will eventually lead to unfair and biased results where people tune models on test data

YZY-stack commented 1 day ago

@yermandy Sorry for the confusion. Our main focus is on addressing the generalization issue in deepfake detection, which has been identified as one of the primary challenges in this field. Many existing works on deepfake detection aim to tackle this issue.

To better convey our setting, let’s use a simple example.

Suppose we have five different forgery methods: A, B, C, D, and E. Protocol-1 in our work involves training the model using forgery methods A, B, and C, and testing it on methods D and E. This is referred to as cross-forgery evaluation.

In this scenario, the validation set consists of samples from the same methods used for training (i.e., methods A, B, and C), while the test set is composed of samples from different forgery methods (i.e., methods D and E). Although the validation set is derived from the test set of ABC, the test set for final evaluation includes entirely different forgery methods (D and E), ensuring that the model is tested on previously unseen methods.

Therefore, this setup can ensure that there is no data leakage, as the training (ABC) and testing (DE) are conducted on distinct forgery methods.

For a within-domain evaluation, where both training and testing are conducted on methods A, B, and C, we agree that using the test set for validation would not make sense. In such cases, we plan to further split the training set into separate training and validation subsets. We will update our results on those cases. But most of our evaluations focus on the cross-domain scenarios for generalization (see protocol-1, 2, 3, 4).

We will update the README file to clarify this protocol and avoid potential confusion.

Thank you for your observation and suggestion.

yermandy commented 1 day ago

Thank you, I was confused exactly in the case of the same manipulation, the same domain case, e.g. train FS (FF), test FS (FF). Which is a part of protocol-1, which was the first protocol I started with

YZY-stack commented 52 minutes ago

Yes, I understand. Thanks for your comment.

Actually, Our work primarily focuses on cross-domain generalization scenarios. The four evaluation protocols proposed in our work are: (1) cross-manipulation (protocol-1); (2) cross-data (protocol-2); (3) cross both manipulation and data (protocol-3); (4) one-v-all, where the model is trained on one manipulation and tested on other methods (protocol-3).

Note that all of these four protocols are designed to perform evaluations solely on cross-domain evaluation. And many previous works have pointed out that within-domain tasks can be relatively easy for detection methods, while cross-domain tasks can be extremely challenging (ref[1], ref[2]). That's why we designed the four mentioned protocols.

Regarding the within-domain result in protocol-1, we understand that including these results could potentially cause misinformation. So we plan to remove the within-domain results in protocol-1 and better clarify this point as we discussed earlier. Additionally, we have removed all validation records in all JSON files, hoping this will make it more clear.

Thank you for pointing out this.