Closed cleong110 closed 1 month ago
Official Citation:
@article{Zhao_Hu_Zhou_Shi_Li_2023, title={BEST: BERT Pre-training for Sign Language Recognition with Coupling Tokenization}, volume={37}, url={https://ojs.aaai.org/index.php/AAAI/article/view/25470}, DOI={10.1609/aaai.v37i3.25470}, abstractNote={In this work, we are dedicated to leveraging the BERT pre-training success and modeling the domain-specific statistics to fertilize the sign language recognition~(SLR) model. Considering the dominance of hand and body in sign language expression, we organize them as pose triplet units and feed them into the Transformer backbone in a frame-wise manner. Pre-training is performed via reconstructing the masked triplet unit from the corrupted input sequence, which learns the hierarchical correlation context cues among internal and external triplet units. Notably, different from the highly semantic word token in BERT, the pose unit is a low-level signal originally locating in continuous space, which prevents the direct adoption of the BERT cross entropy objective. To this end, we bridge this semantic gap via coupling tokenization of the triplet unit. It adaptively extracts the discrete pseudo label from the pose triplet unit, which represents the semantic gesture / body state. After pre-training, we fine-tune the pre-trained encoder on the downstream SLR task, jointly with the newly added task-specific layer. Extensive experiments are conducted to validate the effectiveness of our proposed method, achieving new state-of-the-art performance on all four benchmarks with a notable gain.}, number={3}, journal={Proceedings of the AAAI Conference on Artificial Intelligence}, author={Zhao, Weichao and Hu, Hezhen and Zhou, Wengang and Shi, Jiaxin and Li, Houqiang}, year={2023}, month={Jun.}, pages={3597-3605} }
Reformatted:
@article{Zhao2023BESTPretrainingSignLanguageRecognition,
title = {BEST: BERT Pre-training for Sign Language Recognition with Coupling Tokenization},
volume = {37},
url = {https://ojs.aaai.org/index.php/AAAI/article/view/25470},
doi = {10.1609/aaai.v37i3.25470},
number = {3},
journal = {Proceedings of the AAAI Conference on Artificial Intelligence},
author = {Zhao, Weichao and Hu, Hezhen and Zhou, Wengang and Shi, Jiaxin and Li, Houqiang},
year = {2023},
month = {Jun.},
pages = {3597-3605}
}
Abstract:
In this work, we are dedicated to leveraging the BERT pre-training success and modeling the domain-specific statistics to fertilize the sign language recognition~(SLR) model. Considering the dominance of hand and body in sign language expression, we organize them as pose triplet units and feed them into the Transformer backbone in a frame-wise manner. Pre-training is performed via reconstructing the masked triplet unit from the corrupted input sequence, which learns the hierarchical correlation context cues among internal and external triplet units. Notably, different from the highly semantic word token in BERT, the pose unit is a low-level signal originally locating in continuous space, which prevents the direct adoption of the BERT cross entropy objective. To this end, we bridge this semantic gap via coupling tokenization of the triplet unit. It adaptively extracts the discrete pseudo label from the pose triplet unit, which represents the semantic gesture / body state. After pre-training, we fine-tune the pre-trained encoder on the downstream SLR task, jointly with the newly added task-specific layer. Extensive experiments are conducted to validate the effectiveness of our proposed method, achieving new state-of-the-art performance on all four benchmarks with a notable gain.
c.f. #14
SignBERT+ figure
SignBERT+ uses ONLY hand poses. They say: "We organize the pre-extracted 2D poses of both hands as the visual token sequence."
Also different: SignBERT+ talks about joint, frame, and clip masking, and it sounds like BEST doesn't do different levels of masking, just the frame-level?
Back to BEST:
Our contributions are summarized as follows, • We propose a self-supervised pre-trainable framework. It leverages the BERT success, jointly with the specific design for the sign language domain.
• We organize the main hand and body movement as the pose triplet unit and propose the masked unit modeling (MUM) pretext task. To utilize the BERT objective, we generate the pseudo label for this task via coupling tokenization on the pose triplet unit.
• Extensive experiments on downstream SLR validate the effectiveness of our proposed method, achieving new state-of-the-art performance on four benchmarks with a notable gain.
Datasets:
We conduct experiments on four public sign language datasets, i.e., NMFs-CSL (Hu et al. 2021b), SLR500 (Huang et al. 2018), WLASL (Li et al. 2020a) and MSASL (Joze and Koller 2018).
They compare with SignBERT (but not SignBERT+?) Yeah, they cite the 2021 one.
Global-local enhancement network for NMF-aware sign language recognition is "HMA".
Honestly, I think this one should not be merged until #14 is merged
Interesting:
Notably, different from the highly semantic word token in BERT, the pose unit is a low-level signal originally located in continuous space, which prevents the direct adoption of the BERT cross-entropy objective. To this end, we bridge this semantic gap via coupling tokenization of the triplet unit.
OK, what's that mean?
However, the main obstacle to leverage its success in video SLR is the different characteristics of the input signal. In NLP, the input word token is discrete and pre-defined with high semantics. In contrast, the video signal of sign language is continuous with the spatial and temporal dimensions. This signal is quite lowlevel, making the original BERT objective not applicable. Besides, since the sign language video is mainly characterized by hand and body movements, the direct adoption of the BERT framework may not be optimal.
Basically, our framework contains two stages, i.e., self-supervised pre-training and downstream fine-tuning. During pre-training, we propose the masked unit modeling (MUM) pretext task to capture the context cues. The input hand or body unit embedding is randomly masked, and then the framework reconstructs the masked unit from this corrupted input sequence. Similar to BERT, self-reconstruction is optimized via the cross-entropy objective. To this end, we jointly tokenize the pose triplet unit as the pseudo label, which represents the gesture/body state. After pre-training, the pre-trained Transformer encoder is fine-tuned with the newly added prediction head to perform the SLR task.
"Pose Triplet Unit"?
Oh this seems important, they're using a d-VAE. c.f. https://github.com/sign-language-processing/sign-language-processing.github.io/pull/37
The tokenization provides pseudo labels for our designed pretext task during pre-training. We utilize a discrete variational autoencoder (d-VAE) to jointly convert the pose triplet unit into the triplet tokens (body, left and right hand), motivated by VQ-VAE (Van Den Oord, Vinyals et al. 2017).
Our designed pretext task is MUM, which aims to exploit the hierarchical correlation context among internal and external triplet pose units. Given a pose sequence with a triplet pose unit of length T , we first randomly choose the α · T frames to process the mask operation. For clarification, we define three parts of the pose triplet unit as f l sign,t, f r sign,t and fb sign,t, respectively. If a unit is masked, a learnable masked token emask ∈ RDpart is utilized to replace each part of the triplet unit with 50% probability. Therefore, the masked triplet unit includes three masking cases: only hand masked, only body masked and hand-body masked.
OK, in my own words now, real informally:
So the thing to know about BEST is that they wanted to do BERT-style masked language modeling, but, you know, BERT assumes you've already got discrete, sematically meaningful tokens. So they were, like, well we've got these triplets of left hand, right hand, body (no face keypoints, mind you!), let's make those into triplets and couple them together. OK so then what? Well, van den Oord and Vinyals wrote "Neural discrete representation learning", which lets you take continuous signals and make them into discrete codes, with like a codebook and stuff. So they use one of things to make a tokenizer, they put the coupled triplets into it and get discrete tokens out. OK then what? Well then you've got the discrete tokens, and you mask hand, body, or both, and the transformer has to reconstruct the correct hand position, or body position, or whatever from surrounding context. "You shall know a ~word~ handshape from the company it keeps" I suppose. (and you use positional encodings to inform the model about temporal stuff, SignBERT+ does that too) And then they tried it on Isolated SLR and it seemed to work pretty good.
Also let's do a quick compare/contrast...
SignBERT+ results:
BEST results:
WINNER: SignBERT+
SignBERT+
BEST
WINNER: SignBERT+ again
SignBERT+
BEST
WINNER: TIE
OK, here's the transcript of the ChatGPT-assisted rewrite process: https://chatgpt.com/share/f42199e8-bd03-4ec3-9913-e5e832bb2885
My original summary was flawed in a number of ways.
@Zhao2023BESTPretrainingSignLanguageRecognition introduce BEST, a pretraining method based on masked modeling of pose sequences using a coupled tokenization scheme.
The method takes in pose triplet units (left hand, right hand, and upper-body with arms) as inputs.
The pose for each part of the triplet is tokenized into discrete codes [@van_den_Oord_2017NeuralDiscreteRepresentationLearning].
Then masked modeling is employed: any or all of the three parts may be masked, e.g. left hand, or right hand, or body+hand, or all of them...
Unlike @hu2023SignBertPlus, they do not mask multi-frame sequnces ("clips") or sub-frame portions of a pose unit (joints).
They validate their pretraining method isolated ISR (MS-ASL [@dataset:joze2018ms], WLASL [@dataset:li2020word], SLR500 [@huang2019attention3DCNNsSLR] and NMFs-CSL [@hu2021NMFAwareSLR]).
They experiment with both pose-to-gloss and video-to-gloss via fusion with I3D [@carreira2017quo].
Results on these datasets are SOTA compared to previous methods, and quite similar to those of SignBERT+ [@hu2023SignBertPlus]
Here's the first rewrite suggestions:
@Zhao2023BESTPretrainingSignLanguageRecognition introduced BEST, a pre-training method for sign language recognition based on masked modeling of pose sequences using a coupled tokenization scheme.
The method uses pose triplet units, consisting of the left hand, right hand, and upper-body with arms, as inputs.
Each part of the pose triplet is tokenized into discrete codes to facilitate the modeling process [@van_den_Oord_2017NeuralDiscreteRepresentationLearning].
Masked modeling is then employed, where any or all of the three parts may be masked, such as the left hand, right hand, or upper-body with hands.
Unlike @hu2023SignBertPlus, BEST does not mask multi-frame sequences ("clips") or sub-frame portions of a pose unit (joints).
They validated their pre-training method on isolated sign recognition (ISR) datasets: MS-ASL [@dataset
], WLASL [@dataset
], SLR500 [@huang2019attention3DCNNsSLR], and NMFs-CSL [@hu2021NMFAwareSLR].
They experimented with both pose-to-gloss and video-to-gloss tasks, incorporating fusion with I3D [@carreira2017quo].
The results on these datasets demonstrated state-of-the-art (SOTA) performance compared to previous methods and were similar to those achieved by SignBERT+ [@hu2023SignBertPlus].
Here's the second:
Zhao et al. (2023) introduce BEST, a pretraining framework designed to enhance sign language recognition (SLR) by leveraging BERT's masked modeling approach and a novel coupled tokenization scheme.
The method uses pose triplet units, which include the left hand, right hand, and upper-body (including arms), as input data.
Each component of the pose triplet unit is tokenized into discrete codes, following the approach by van den Oord et al. (2017).
Masked modeling is then applied, where any or all components of the triplet (left hand, right hand, or upper-body) may be masked to learn the hierarchical correlation among them.
Unlike SignBERT+ by Hu et al. (2023), the BEST framework does not mask multi-frame sequences ("clips") or sub-frame portions (joints) of a pose unit.
The authors validate their pretraining method on isolated sign recognition (ISR) tasks using the MS-ASL [@dataset
], WLASL [@dataset
], SLR500 [@huang2019attention3DCNNsSLR], and NMFs-CSL [@hu2021NMFAwareSLR] datasets.
The authors conduct experiments on both pose-to-gloss and video-to-gloss tasks by integrating their method with the Inflated 3D ConvNet (I3D) [@carreira2017quo].
The results demonstrate state-of-the-art (SOTA) performance on all evaluated datasets, showing notable improvements over previous methods and comparable results to SignBERT+ [@hu2023SignBertPlus].
In addition the PR had various suggestions, e.g. https://github.com/sign-language-processing/sign-language-processing.github.io/pull/61#discussion_r1634494728
@Zhao2023BESTPretrainingSignLanguageRecognition introduce BEST (BERT Pre-training for Sign Language Recognition with Coupling Tokenization), a pre-training method based on masked modeling of pose sequences using a coupled tokenization scheme.
This method takes pose triplet units (left hand, right hand, and upper-body with arms) as inputs, each tokenized into discrete codes [@van_den_Oord_2017NeuralDiscreteRepresentationLearning].
Masked modeling is then applied, where any or all components of the triplet (left hand, right hand, or upper-body) may be masked, to learn hierarchical correlations among them.
Unlike @hu2023SignBertPlus, BEST does not mask multi-frame pose sequences or individual joints.
The authors validate their pre-training method on isolated sign recognition (ISR) tasks using MS-ASL [@dataset:joze2018ms], WLASL [@dataset:li2020word], SLR500 [@huang2019attention3DCNNsSLR], and NMFs-CSL [@hu2021NMFAwareSLR].
Besides pose-to-gloss, they also experiment with video-to-gloss tasks via fusion with I3D [@carreira2017quo].
Results on these datasets demonstrate state-of-the-art performance compared to previous methods and are comparable to those of SignBERT+ [@hu2023SignBertPlus].
Merged
https://ojs.aaai.org/index.php/AAAI/article/view/25470
dataset:
. Exclude wordy abstracts. (better BibTex extension to Zotero can exclude keys){}
in the bibtexPR:
git merge master
on branchWriting/style: