facebookresearch / MMCSG

This repository contains the baseline system for CHiME-8 MMCSG challenge focusing on transcribing both sides of a conversation where one participant is wearing smart glasses equipped with a microphone array and camera.
Other
25 stars 2 forks source link

Doubts about the rules of the competition #4

Closed rookie0607 closed 5 months ago

rookie0607 commented 5 months ago

Acknowledgement of the rules of the competition includes the clause "The submitted system must be streaming, i.e. process the input in chronological order and specify a delay time for each word sent, as detailed in the subsection "Evaluation". The system must not use any global information from the recording until it has been processed chronologically. Such global information may include global normalization, non-streaming speaker identification, or journaling. This requirement regarding streaming processing applies to all modalities (audio, visual, accelerometer, gyroscope, etc.)." Do front-end processing modules such as continuous speech separation have to be streaming? Is it possible to do the separation uniformly with an offline css model first and then feed it into a streaming asr model for recognition?

zmolikova commented 5 months ago

The latency evaluation should be conducted for the entire system considering all the components. That is, if the front-end separation uses any future information, this needs to be reflected in the word time-stamps and the evaluated latency. However, for evaluation, we will divide the systems into four categories based on their average latency with thresholds of 1000ms, 350ms, 150ms. A non-streaming system can still be submitted and will fall into the category of latency > 1000ms. This is currently not completely clear from the formulation of the rules at the website and we will update them to emphasise this better.

If the CSS module of your system uses the entire recording to generate outputs, please set the word-timestamps, which are used to compute the latency, to the length of the entire recording. Furthermore, if the ASR module of your system is streaming, you may also compute the latency of this module separately and report it in the system description. We are definitely interested in such submission too. But to keep the rules fair, we will rank the system based on the overall latency.

zmolikova commented 5 months ago

Closing the issue. Feel free to re-open or make a new issue if you have any further questions.