Facing problem with preprocessing for the single video inference

Please crop/align the facial video following the VoxCeleb2 preprocessing method. As for the metadata, please follow the following fields description to build json file by yourself.

file: str. The file path.
- n_fakes: int. The number of fake segments. It should be 1 for classification.
- fake_periods: list[list[float]]. A list of fake segments' start and end timestamp pairs, for example [[2.5, 2.78], [3.1, 3.23]] means there are 2 fake segments. One is from 2.5 second to 2.78 second, another is from 3.1 second to 3.23 second. The legnth of this list should fit the n_fakes. It should be [[0, duration]] for classification, as we consider the whole video is a fake segment from the begining to the last.
- duration: float. length of the video in seconds.
- original: str. The original real video of fake video. Set to null for real video.
- modify_video: bool. The visual modality is modified or not.
- modify_audio: bool. The audio modality is modified or not.
- split: The train/dev/test split.
- video_frames: The number of frames for this sample.
- audio_channels: The number of audio channel. 1 for mono channel and 2 for stereo.
- audio_frames: The length of the audio wave form array.

ControlNet / LAV-DF

Facing problem with preprocessing for the single video inference #4