How are you generating ground truth for narrations?

thechargedneutron commented 3 years ago

Thanks for the good work. I have a simple question regarding data processing. For the video clips, the CrossTask dataset has annotations. For example, the annotations for 113766_JFnZHAOUClw.csv is as follows:

1,40.51,44.21
2,46.43,48.93
2,51.44,52.84
3,65.4,68.4
4,76.12,77.92
6,78.25,82.65
3,89.24,91.14
4,98.59,100.29
8,118.06,121.06
10,121.92,126.22
8,127.71,130.61
10,133.72,137.72

which means season steak happens (visually) from 40.51 to 44.21 and so on.

But for the narrations, how are you mapping key-steps to narrations? In table 1 in the paper, you have shown the mapping in bold, but I could not find a reference of how to achieve that in the code? Can you please point me to that? I need ground truth narrations mapped to the key-steps for my research purpose.

Yuhan-Shen commented 3 years ago

For table 1 in the paper, we manually mark the alignment between extracted verb phrases in narrations and the ground truth key-steps. We don't have a code to achieve that. We only manually mark the key-steps in narrations for a few videos.

We tried to compare the similarity between semantic embeddings of narrations and key-steps to localize the key-steps, which may give some meaningful alignments, but it seems that the alignment is not perfect and difficult to quantify the quality.

thechargedneutron commented 3 years ago

Makes sense. Just to confirm my understanding, do you manually mark the key-steps in narrations that is shown in Figure 1 as well?

Yuhan-Shen commented 3 years ago

Yes, you are correct.

thechargedneutron commented 3 years ago

Thanks!

Yuhan-Shen / VisualNarrationProceL-CVPR21

How are you generating ground truth for narrations? #2