Open sjghh opened 2 hours ago
For copyright reasons, we publicly share the annotated labels and the corresponding video timestamps.
I have a question regarding the statement in the paper: 'To enhance labor efficiency and reduce costs, we employ a large model like GPT-3.5 for coarse-grained annotation, followed by manual fine-grained calibration. The overall annotation accuracy of GPT-3.5 is about 25%, which is low.' However, GPT-3.5 does not have the capability to handle multimodal data. What kind of data was processed in this case? Did I misunderstand something, or am I missing some details?
Would it be possible for you to consider open-sourcing the raw dataset. I believe this would greatly benefit the research community and further enhance the impact of your work.