I was searching for new research direction and found your work might be a good start since there was still a large gap in this new subtask. So I'm trying to reproduce basically all the results in the paper. But I don't know much about the literature in this specific field, could you please answer a few questions about the paper?
What's the source of the figure.1 in the paper? Is there any way to download the original videos for case study and visualization? As far as I know, the official release of TVR dataset only contains the pre-extracted features without exact moments(start/end points), so did you find the original video and check the corresponding attention weights manually?
Is there any possibility that a text query may correspond to multiple moments/clips in an untrimmed video? If so I think there might be space for improment since the work in ms-sl only used one key-clip with the highest similarity.
I was searching for new research direction and found your work might be a good start since there was still a large gap in this new subtask. So I'm trying to reproduce basically all the results in the paper. But I don't know much about the literature in this specific field, could you please answer a few questions about the paper?
What's the source of the figure.1 in the paper? Is there any way to download the original videos for case study and visualization? As far as I know, the official release of TVR dataset only contains the pre-extracted features without exact moments(start/end points), so did you find the original video and check the corresponding attention weights manually?
Is there any possibility that a text query may correspond to multiple moments/clips in an untrimmed video? If so I think there might be space for improment since the work in ms-sl only used one key-clip with the highest similarity.