Closed hulianyuyy closed 2 years ago
@hulianyuyy There are two widely used ways to align the labels with features, the greedy alignment (selecting the most likely gloss at each step) and the dominant alignment (DNF, TMM'19). We adopt the greedy alignment for simplicity and show the network predictions. The alignment between features and frames is simply based on its temporal receptive field, for example, the temporal receptive field of Subgloss-wise conv1d (C5-P2) is 6, so the logit zt is corresponding to frames f[t2, t2+6] (no padding setting).
A question that confused me is how to find the ground truth label of frames (as marked in Fig.5 in your paper). As far as we know, the Phoenix Dataset only provide labels in video domain but not precisely assigned to frames.
@hulianyuyy We mannually labelled several sequences for visualization. (This website may be helpful for Phoenix).
Many thanks for you kind reply.
Thanks for your great job. I'm wondering how to draw a picture like Fig.5 in your paper. The key point lies in how to align labels with frames. Could you provide some advice? Thanks in advance!