First of all, thanks for the great work and exciting competition.
When loading the data, I noticed a slight mismatch between the number of audio frames provided by the metadata and the number of audio frames when using torchvision.io.read_video(). This only applies when the audio is fake; for real audio, the number of audio frames matches.
The minimal sample below returns for me:
pytorch: torch.Size([1, 112640]) metadata: 111680
Code:
import json
from torchvision.io import read_video
video_path = "<path to dataset>/DeepFake_1M/train/id06744/_c3CCbnZEbU/00011/real_video_fake_audio.mp4"
video_metadata_path = "<path to dataset>/DeepFake_1M/train_metadata/id06744/_c3CCbnZEbU/00011/real_video_fake_audio.json"
frames, audio, sample = read_video(video_path, pts_unit="sec", output_format="TCHW")
print("pytorch:", audio.size())
with open(video_metadata_path, "r") as f:
metadata = json.load(f)
print("metadata:", metadata["audio_frames"])
Versions:
torch 2.3.0
torchvision 0.18.0
I hope you can help me understand this mismatch. Thank you
First of all, thanks for the great work and exciting competition. When loading the data, I noticed a slight mismatch between the number of audio frames provided by the metadata and the number of audio frames when using
torchvision.io.read_video()
. This only applies when the audio is fake; for real audio, the number of audio frames matches. The minimal sample below returns for me:pytorch: torch.Size([1, 112640]) metadata: 111680
Code:
Versions: torch 2.3.0 torchvision 0.18.0
I hope you can help me understand this mismatch. Thank you