SRA2 / SPELL

Learning Long-Term Spatial-Temporal Graphs for Active Speaker Detection (ECCV 2022)
MIT License
65 stars 9 forks source link

Question: Online inference #4

Closed hsato1 closed 1 year ago

hsato1 commented 1 year ago

Thank you so much for such a wonderful paper!

I am working on exploring active speaker detection for real time and came across this paper and repo and wanted to ask a question. Is it possible to do an online inference of active speaker with this approach for live video stream??

Thank you so much!

kylemin commented 1 year ago

Hi,

Yes, it is possible, but you would need to make the graph construction online. Specifically, you can create graphs on the fly by integrating l.186-223 of data_loader.py into the inference loop. In this case, the larger number of nodes (numv) results in higher latency, so there will be a trade-off.

Thank you, Kyle

hsato1 commented 1 year ago

That makes sense!

Thank you so much for your response!

hsato1 commented 1 year ago

Hello again,

I wanted to clarify in terms of real-time inference further, the consistent real time inference is possible under the assumption that the we were able to detect face and crop their facial features properly for each incoming frame in a video stream or according to the paper 11 consecutive frames of faces cropped? Then we need to encode the cropped image and corresponding audio using the 2DResNet with TSM, correct? And that encoding process requires different computational power and time?

Thank you so much, Hiro

kylemin commented 1 year ago

Hi Hiro,

Yes, all of your assumptions are correct. Our code assumes that the face bounding boxes and their initial audio-visual features are computed by other models. I hope this clarifies your questions!

Best regards, Kyle

hugoobui commented 1 year ago

Hello @hsato1, did you manage to make it work on real-time ? Thank you so much