ControlNet / LAV-DF

[CVIU] Glitch in the Matrix: A Large Scale Benchmark for Content Driven Audio-Visual Forgery Detection and Localization
https://www.sciencedirect.com/science/article/pii/S1077314223001984
Other
67 stars 8 forks source link

How to train the MLP classifier on DFDC? #3

Closed liushenme closed 3 months ago

liushenme commented 1 year ago

Hi,

In your paper, you said that you trained a MLP classifier using the confidences of predicted segments for deepfake classification on DFDC. I wonder if the MLP classifier is trained separately with confidences, or trained as a backend with the whole audio-visual model?

Liu

ControlNet commented 1 year ago

For the classification task, we trained BA-TFD as a temporal localization task, which means we regard the fake video in DFDC as a single fake segmenet label with timestamp [0, video_length]. Then, from the boundary map (size 512, 40) generated by BATFD, we used 2 ways to get binary label.

  1. Thresholding by the maximum value.
  2. Trained another 3-layer MLP to map from the boundary map to binary label.

So it's trained seperately to map the label space.

liushenme commented 1 year ago

I see. Thanks for your reply.

mowen9 commented 8 months ago

For the classification task, we trained BA-TFD as a temporal localization task, which means we regard the fake video in DFDC as a single fake segmenet label with timestamp [0, video_length]. Then, from the boundary map (size 512, 40) generated by BATFD, we used 2 ways to get binary label.

  1. Thresholding by the maximum value.
  2. Trained another 3-layer MLP to map from the boundary map to binary label.

So it's trained seperately to map the label space.

I am so sorry and I do not understand the above two operations. The size of the boundary map generated by BATFD is (batch_size, 40, 512). I know that the number of frames is 40 and the temporal dimension is 512.

I understand that I can obtain the binary label (batch_size, 1) via thresholding by the maximum value when the size of the boundary map is (batch_size, 40). However, how can I get the binary label when the size is (batch_size, 40, 512)?

Could you help me solve this question? Thanks!

mowen9 commented 8 months ago

I see. Thanks for your reply.

How did you solve this question?

I understand that I can obtain the binary label (batch_size, 1) via thresholding by the maximum value when the size of the boundary map is (batch_size, 40). However, how can I get the binary label when the size is (batch_size, 40, 512)?

ControlNet commented 8 months ago

I understand that I can obtain the binary label (batch_size, 1) via thresholding by the maximum value when the size of the boundary map is (batch_size, 40). However, how can I get the binary label when the size is (batch_size, 40, 512)?

There are 2 ways we achieve for this. One is just simply applying maximum to (40, 512) to get the video-level prediction.

The second way is to use the 3-layer MLP. Firstly, for the train and validation set, inference the (40, 512) boundary maps, then use this boundary maps as the input, use the video label as the output to train a MLP based on this. Then, for the test set, use the MLP to infer the video-level predictions. Actually, you can consider the boundary maps as the "features", then you need to train a classifier head.