Open Davidyao99 opened 1 year ago
Hello @Davidyao99 ,
Thanks for your question. Here is the verbatim guidelines used in the annotation process (intermediate part skipped for brevity).
For the specific object queries (“when did I last see X; where did I put X?”), be sure to annotate only the last occurrence of that object. ... We only want to ensure that the marked object has not moved between the time it was marked and the end of the video.
Natural Language Queries (NLQ) are assumed to be asked at the end of the video and thus the right window should be the last occurrence of the object. It is likely that there is some noise due to annotator errors. Do you know how often such instances occur?
Great project! I was looking through the annotations for the NLQ task, and notice that there might be multiple instances in the video that answers the given query. In the paper, it seems that queries are chosen in a way such that answers are unambiguous.
An example of such ambiguity is in video id: 3534864b-2289-4aaf-b3ed-10eeeee7acd2 and query: "Where did I put the scooper". The ground truth is given to be around 1675s.
These seem to be appropriate responses to the query that is different from the ground truth. These also seem to fall within the time interval of the clip.