I found that the seq and pos, that is the GT caption and the corresponding GT pos, within each batch are not consistent one-by-one. In the dataloader, I found the number of captions and pos for the same video is even different. Shouldn't they be the same and one-to-one consistent?
There are multiple GT captions for each video, each of which can be used for training. I don't think this should be limited to exactly one-to-one. But the alignment problem you point out does exist.
Hi,
Thank you!