SJTUwxz / LoCoNet_ASD

code repo for LoCoNet: Long-Short Context Network for Active Speaker Detection
16 stars 4 forks source link

Applying the model to new video data #2

Open plnguyen2908 opened 2 months ago

plnguyen2908 commented 2 months ago

Hi,

I am really interested in your model. However, I do not know how to apply your model to a new fresh video data. Can you help me answer several answers?

  1. I am currently having a mp4 file. I have extracted into frames and mel-spectrogram data. However, the time stamp for video and audio does not match, do I need to match those?
  2. For the face crop, which model can I use and do they need to have the same dimension after cropping because from your paper, they need to be H x W?
  3. After having the face crop and audio data, Do I need to do any preprocessing? If yes, is there any note how to preprocess the 2 data.
  4. In the inference step, If I am not wrong, I need to load the loconet model from loconet.py with weight and run similar to evaluate_network function right?

Hope you can answer my question or even better if you can create new new pyinb file with an example on how to run the model on new video.

SJTUwxz commented 1 month ago

Hi! Thank you for your interest in our work!

  1. Yes, the audio and the video need to be aligned temporally to be used as input to the model.
  2. I am using the face crops annotation released by the datasets (e.g. AVA-ActiveSpeaker). For a random video, you can choose any face detection methods like RetinaFace. The face crops need to be resized to have the same dimension during preprocessing in order to be concatenated into face tracks.
  3. I think you need to generate face tracks given the face crops. The audio only needs to be processed into mel-spectrogram feature, which will be used as input. To generate the face track of a single speaker, you can refer to TalkNet's pipeline of inference on random video.
  4. During inference, you can load our released checkpoint on AVA-ActiveSpeaker dataset. After preprocessing your video and getting the face tracks and audio mel-spectrogram, give them as input to the model and get the prediction scores of speaking activities.
  5. We don't have plans to release code to inference on random videos for now, but I will keep you posted when we release a demo version later.

Thank you and please let me know if you have further questions!