VIPL-SLP / VAC_CSLR

Visual Alignment Constraint for Continuous Sign Language Recognition. ( ICCV 2021)
https://openaccess.thecvf.com/content/ICCV2021/html/Min_Visual_Alignment_Constraint_for_Continuous_Sign_Language_Recognition_ICCV_2021_paper.html
Apache License 2.0
116 stars 19 forks source link

Issue about alignment between label and frames. #4

Closed hulianyuyy closed 2 years ago

hulianyuyy commented 2 years ago

Thanks for your great job. I'm wondering how to draw a picture like Fig.5 in your paper. The key point lies in how to align labels with frames. Could you provide some advice? Thanks in advance!

ycmin95 commented 2 years ago

@hulianyuyy There are two widely used ways to align the labels with features, the greedy alignment (selecting the most likely gloss at each step) and the dominant alignment (DNF, TMM'19). We adopt the greedy alignment for simplicity and show the network predictions. The alignment between features and frames is simply based on its temporal receptive field, for example, the temporal receptive field of Subgloss-wise conv1d (C5-P2) is 6, so the logit zt is corresponding to frames f[t2, t2+6] (no padding setting).

hulianyuyy commented 2 years ago

A question that confused me is how to find the ground truth label of frames (as marked in Fig.5 in your paper). As far as we know, the Phoenix Dataset only provide labels in video domain but not precisely assigned to frames.

ycmin95 commented 2 years ago

@hulianyuyy We mannually labelled several sequences for visualization. (This website may be helpful for Phoenix).

hulianyuyy commented 2 years ago

Many thanks for you kind reply.