Closed yusuke-intern closed 1 month ago
We acknowledge that our candidate options do include actions that have already occurred in the video. However, it is important to note that the action narrations provided in the task_progress_metadata are intended to serve as a reference only. In practice, during the model inference process, using information from the ground-truth action narrations is not allowed. The model must rely solely on visual observations to infer task progress. Therefore, your approach of using the ground-truth task_progress_metadata to eliminate options is not appropriate.
I had a meta-analysis of each dataset and found interesting results.
(if we assume the narration text correctly describes the action in the video.) Please note that I may make mistakes.