Closed hsato1 closed 1 year ago
Hi,
Yes, it is possible, but you would need to make the graph construction online. Specifically, you can create graphs on the fly by integrating l.186-223 of data_loader.py into the inference loop. In this case, the larger number of nodes (numv) results in higher latency, so there will be a trade-off.
Thank you, Kyle
That makes sense!
Thank you so much for your response!
Hello again,
I wanted to clarify in terms of real-time inference further, the consistent real time inference is possible under the assumption that the we were able to detect face and crop their facial features properly for each incoming frame in a video stream or according to the paper 11 consecutive frames of faces cropped? Then we need to encode the cropped image and corresponding audio using the 2DResNet with TSM, correct? And that encoding process requires different computational power and time?
Thank you so much, Hiro
Hi Hiro,
Yes, all of your assumptions are correct. Our code assumes that the face bounding boxes and their initial audio-visual features are computed by other models. I hope this clarifies your questions!
Best regards, Kyle
Hello @hsato1, did you manage to make it work on real-time ? Thank you so much
Thank you so much for such a wonderful paper!
I am working on exploring active speaker detection for real time and came across this paper and repo and wanted to ask a question. Is it possible to do an online inference of active speaker with this approach for live video stream??
Thank you so much!