ControlNet / AV-Deepfake1M

AV-Deepfake1M: A Large-Scale LLM-Driven Audio-Visual Deepfake Dataset
https://arxiv.org/abs/2311.15308
Other
58 stars 1 forks source link

Metadata number audio frames does not match real number of audio frames #7

Closed MKlmt closed 2 months ago

MKlmt commented 2 months ago

First of all, thanks for the great work and exciting competition. When loading the data, I noticed a slight mismatch between the number of audio frames provided by the metadata and the number of audio frames when using torchvision.io.read_video(). This only applies when the audio is fake; for real audio, the number of audio frames matches. The minimal sample below returns for me: pytorch: torch.Size([1, 112640]) metadata: 111680

Code:

import json
from torchvision.io import read_video
video_path = "<path to dataset>/DeepFake_1M/train/id06744/_c3CCbnZEbU/00011/real_video_fake_audio.mp4"
video_metadata_path = "<path to dataset>/DeepFake_1M/train_metadata/id06744/_c3CCbnZEbU/00011/real_video_fake_audio.json"
frames, audio, sample = read_video(video_path, pts_unit="sec", output_format="TCHW")
print("pytorch:", audio.size())
with open(video_metadata_path, "r") as f:
    metadata = json.load(f)
print("metadata:", metadata["audio_frames"])

Versions: torch 2.3.0 torchvision 0.18.0

I hope you can help me understand this mismatch. Thank you

ControlNet commented 2 months ago

The metadata is generated for simple references without loading the video file. Please develop based on the frame number from real audio.

MKlmt commented 2 months ago

Alright. Thank you