erprogs / GenConViT

Deepfake Video Detection Using Generative Convolutional Vision Transformer
GNU General Public License v3.0
46 stars 9 forks source link

Questions about Labels #5

Closed nguyenvulong closed 7 months ago

nguyenvulong commented 7 months ago

Hello, Thank you for the great end-to-end approach.

  1. May I ask about the labels in this code (I am aware of the output but I'd like to clarify)

the first value of torch.sigmoid(model(df).squeeze()) is fake probability, right? For example, after the sigmoid function we have tensor([[0.0468, 0.9539]], device='cuda:0'), then 0.0468 is the chance of the sample being fake, correct?

def pred_vid(df, model):
    with torch.no_grad():
        y_pred = model(df)
        return max_prediction_value(torch.sigmoid(model(df).squeeze()))
  1. Also, the code is a bit counterintuitive... why we have to XOR? Since the prediction is 0, then the label then become 1, which is FAKE. Could you please explain.
    def real_or_fake(prediction):
    return {0: "REAL", 1: "FAKE"}[prediction ^ 1]
erprogs commented 7 months ago

Hello @nguyenvulong,

Thank you for your interest in our work!

The real class when the data loader loads the images is {FAKE: 0, REAL: 1}, because fake is the folder it first loads. However, for our prediction we output FAKE: 1 and REAL: 0

data loading and training : {FAKE: 0, REAL: 1} Prediction: {FAKE: 1, REAL: 0}, in many deepfake challenges, fake class is 1.

In our current case (tensor([[0.0468, 0.9539]], device='cuda:0')), max_prediction_value returns 1, and abs(1 - 0.9539) = 0.0461. the class is real.

Now, to be consistent and reflect with how the data loader classifies the images, I used {0: "REAL", 1: "FAKE"}, which then I can easily flip using XOR.

if the prediction is 0 => 1, if prediction is 1 => 0

we can write an if else statement, but i found XOR a nicer way to do it.

I hope this answers your question.

nguyenvulong commented 7 months ago

Thank you it's clear. However I need to discuss about the max_prediction_value in the case of custom videos (not from the datasets but collected in the wild).

python prediction.py \             
    --p ./test_videos \
    --f 1 \
    --d yours \
    --n ed
def max_prediction_value(y_pred):
    # Finds the index and value of the maximum prediction value.
    mean_val = torch.mean(y_pred, dim=0)
    print(f"mean_val: {mean_val}")
    print(f"y_pred: {y_pred}")
    print(f"y_pred dim: {y_pred.dim()}")
    return (
        torch.argmax(mean_val).item(),
        mean_val[0].item()
        if mean_val[0] > mean_val[1]
        else abs(1 - mean_val[1]).item(),
    )

In this case, the input to max_prediction_value which is y_pred (or torch.sigmoid(model(df).squeeze()) from pred_vid) has only one dimension (Please refer to the values I printed). Therefore, mean_val[0] and mean_val[1] will cause the following error.

mean_val: 0.49663394689559937
y_pred: tensor([0.3401, 0.6532], device='cuda:0')
y_pred dim: 1
An error occurred: invalid index of a 0-dim tensor. Use `tensor.item()` in Python or `tensor.item<T>()` in C++ to convert a 0-dim tensor to a number

My guess is in your experience, some how y_pred is a batch with two dimensions after torch.sigmoid(model(df).squeeze()) from pred_vid. Maybe that's the case with the datasets [ dfdc, faceforensics, timit, celeb ] but not the custom dataset?

nguyenvulong commented 7 months ago

Update: I know why it happened. Because the number of frames = 1. This is the updated code to handle both cases when number of frames = 1 and >1. I will create a PR if you don't mind.

def max_prediction_value(y_pred):

    if y_pred.dim() == 1 and y_pred.size(0) == 2:
        # When y_pred is a 1D tensor with two elements, no need to take the mean
        pred_label = torch.argmax(y_pred).item()
        pred_val = y_pred[pred_label].item()
        return (pred_label, pred_val)
    else:
        # Compute the mean value across the batch dimension (dim=0)
        mean_val = torch.mean(y_pred, dim=0)
        # Still, check if mean_val is not a 0-dimensional tensor, just to be safe
        if mean_val.dim() == 0:
            mean_val_val = mean_val.item()
            return (0, mean_val_val) if mean_val_val > 0.5 else (1, 1 - mean_val_val)
        # Assume mean_val is a tensor with more than one dimension
        pred_label = torch.argmax(mean_val).item()
        pred_val = mean_val[pred_label].item()
        return (pred_label, pred_val)
erprogs commented 7 months ago

Hello @nguyenvulong,

Thank you for finding the issue with number of frames being 1 and for your PR. I've checked the code and it indeed fails when the number of frame is 1.

I've seen your code, and thank you for your effort. However, I've a simple work around: check if the mean_val.numel() is 1. if mean_val.numel() is 1, then use y_pred directly, since its detecting a single prediction rather than a batch. if mean_val.numel() is greater than 1, continue.

Then the rest of the code logic remains the same.

def max_prediction_value(y_pred):
    # Finds the index and value of the maximum prediction value.
    mean_val = torch.mean(y_pred, dim=0)

    # Check if mean_val element is 1
    if mean_val.numel() == 1:
        mean_val = y_pred

    return (
        torch.argmax(mean_val).item(),
        mean_val[0].item()
        if mean_val[0] > mean_val[1]
        else abs(1 - mean_val[1]).item(),

    )

I've tested the updated code with 1 frame using the sample videos. The image shows the prediction using 1 frame.

image_2024-01-21_13-13-15

Thank you again!

nguyenvulong commented 7 months ago

Thanks so much. I will close the PR and this question for now!