ControlNet / LAV-DF

[CVIU] Glitch in the Matrix: A Large Scale Benchmark for Content Driven Audio-Visual Forgery Detection and Localization
https://www.sciencedirect.com/science/article/pii/S1077314223001984
Other
67 stars 8 forks source link

Frame-level processing #12

Closed javadmozaffari closed 3 months ago

javadmozaffari commented 9 months ago

Hello,

In this new Temporal Forgery Localization model, the entire video is used as input. The existing model proposed by the authors has demonstrated promise in achieving accurate results. Although the current input strategy consists of the entire video, it may pose a challenge in terms of memory consumption, especially for large datasets or videos with high resolution frames. Would it be possible to modify the Temporal Forgery Localization model to accept individual frames instead of the entire video? This would result in a reduction in the amount of RAM required.

ControlNet commented 8 months ago

Hi, sorry for the late reply.

I think it might be hard, because the boundary matching mechanism require all frames as the input. But for saving the memory, I think you can try 2 ways to reduce the temporal size.

  1. Strided sample frames for each video. For example, only use the 1st, 3rd, 5th, 7th, ... frames, so you will have less frame counts.
  2. Interpolate the temporal axis for a fixed value for each video. For example, no matter the length of the video is, resize to 100 frames.

But the pretrained model is not trained with this preprocessing, so it might not perform well if you want to use this way to evaluate.