Open ChandanVerma opened 1 month ago
Hello, thank you for your interest in our work.
We followed preprocessing steps in Auto-AVSR.
Using face detection model (retinaface) to predict bounding box of face
.bbox.pkl
file is for the bounding box.Using landmark detection model to predict facial landmark, transforming the input video, and extracting mouth roi
.lip.mp4
file is for the mouth roi video.Then we can get face and lip video of each person.
Additionally, since the av2unit model is trained on noisy speech and corresponding lip video, I think it can handle mixed speech from multiple speakers along with the lip video of each person.
Thanks a lot for the response. And i really appreciate you opensourcing the project. I did try to use retinanet to detect multiple faces and prepared a .bbox and lip video for just the lip landmarks of multiple people. But unfortunately i guess it expects the .bbox and lip video to have the same number of frames....does it mean that 1. each element in the bbox file will be a list of list indicating various mouth landmarks in that particulat frame.
When multiple faces are present in a frame, there will be several bboxes in detected in that frame. I think you need to rearrange the sequence of bboxes so that each face has a single corresponding bbox for each frame. Also, if the face detector fails to detect any face in a frame, it will not return a bbox. You can use interpolation for missing bbox or leave it 'None' to maintain alignment with the number of frames.
After preprocessing, the lip video will have a spatial size of 96 x 96 pixels and won't include a black background. You can find an example of this in the samples/en
directory.
Wanted to enquire if this will work for multiple people talking in the same frame. If yes, how to produce the .lips video and the .pkl for the original video