choijeongsoo / av2av

[CVPR 2024] AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation
MIT License
20 stars 2 forks source link

Will this work for multiple people in the same frame? #6

Open ChandanVerma opened 1 month ago

ChandanVerma commented 1 month ago

Wanted to enquire if this will work for multiple people talking in the same frame. If yes, how to produce the .lips video and the .pkl for the original video

choijeongsoo commented 1 month ago

Hello, thank you for your interest in our work.

We followed preprocessing steps in Auto-AVSR.

  1. Using face detection model (retinaface) to predict bounding box of face

    • .bbox.pkl file is for the bounding box.
    • This file is utilized in both landmark detection model and face renderer.
    • Here, we can detect multiple faces in the video.
  2. Using landmark detection model to predict facial landmark, transforming the input video, and extracting mouth roi

    • .lip.mp4 file is for the mouth roi video.
    • Here, we can use landmark detection model multiple times for each bounding box.

Then we can get face and lip video of each person.

Additionally, since the av2unit model is trained on noisy speech and corresponding lip video, I think it can handle mixed speech from multiple speakers along with the lip video of each person.

ChandanVerma commented 1 month ago

Thanks a lot for the response. And i really appreciate you opensourcing the project. I did try to use retinanet to detect multiple faces and prepared a .bbox and lip video for just the lip landmarks of multiple people. But unfortunately i guess it expects the .bbox and lip video to have the same number of frames....does it mean that 1. each element in the bbox file will be a list of list indicating various mouth landmarks in that particulat frame.

  1. Also the lip video should be exactly the same as the original video dimensions with just the talking mouth and a black background?
choijeongsoo commented 3 weeks ago
  1. When multiple faces are present in a frame, there will be several bboxes in detected in that frame. I think you need to rearrange the sequence of bboxes so that each face has a single corresponding bbox for each frame. Also, if the face detector fails to detect any face in a frame, it will not return a bbox. You can use interpolation for missing bbox or leave it 'None' to maintain alignment with the number of frames.

  2. After preprocessing, the lip video will have a spatial size of 96 x 96 pixels and won't include a black background. You can find an example of this in the samples/en directory.