cleong110 / sign-language-processing.github.io

Documentation and background of sign language processing
0 stars 0 forks source link

BEST: BERT Pre-training for Sign Language Recognition with Coupling Tokenization #19

Closed cleong110 closed 1 month ago

cleong110 commented 1 month ago

https://ojs.aaai.org/index.php/AAAI/article/view/25470

PR:

Writing/style:

cleong110 commented 1 month ago

Official Citation:

@article{Zhao_Hu_Zhou_Shi_Li_2023, title={BEST: BERT Pre-training for Sign Language Recognition with Coupling Tokenization}, volume={37}, url={https://ojs.aaai.org/index.php/AAAI/article/view/25470}, DOI={10.1609/aaai.v37i3.25470}, abstractNote={In this work, we are dedicated to leveraging the BERT pre-training success and modeling the domain-specific statistics to fertilize the sign language recognition~(SLR) model. Considering the dominance of hand and body in sign language expression, we organize them as pose triplet units and feed them into the Transformer backbone in a frame-wise manner. Pre-training is performed via reconstructing the masked triplet unit from the corrupted input sequence, which learns the hierarchical correlation context cues among internal and external triplet units. Notably, different from the highly semantic word token in BERT, the pose unit is a low-level signal originally locating in continuous space, which prevents the direct adoption of the BERT cross entropy objective. To this end, we bridge this semantic gap via coupling tokenization of the triplet unit. It adaptively extracts the discrete pseudo label from the pose triplet unit, which represents the semantic gesture / body state. After pre-training, we fine-tune the pre-trained encoder on the downstream SLR task, jointly with the newly added task-specific layer. Extensive experiments are conducted to validate the effectiveness of our proposed method, achieving new state-of-the-art performance on all four benchmarks with a notable gain.}, number={3}, journal={Proceedings of the AAAI Conference on Artificial Intelligence}, author={Zhao, Weichao and Hu, Hezhen and Zhou, Wengang and Shi, Jiaxin and Li, Houqiang}, year={2023}, month={Jun.}, pages={3597-3605} }
cleong110 commented 1 month ago

Reformatted:

@article{Zhao2023BESTPretrainingSignLanguageRecognition,
  title        = {BEST: BERT Pre-training for Sign Language Recognition with Coupling Tokenization},
  volume       = {37},
  url          = {https://ojs.aaai.org/index.php/AAAI/article/view/25470},
  doi          = {10.1609/aaai.v37i3.25470},
  number       = {3},
  journal      = {Proceedings of the AAAI Conference on Artificial Intelligence},
  author       = {Zhao, Weichao and Hu, Hezhen and Zhou, Wengang and Shi, Jiaxin and Li, Houqiang},
  year         = {2023},
  month        = {Jun.},
  pages        = {3597-3605}
}
cleong110 commented 1 month ago

Abstract:

In this work, we are dedicated to leveraging the BERT pre-training success and modeling the domain-specific statistics to fertilize the sign language recognition~(SLR) model. Considering the dominance of hand and body in sign language expression, we organize them as pose triplet units and feed them into the Transformer backbone in a frame-wise manner. Pre-training is performed via reconstructing the masked triplet unit from the corrupted input sequence, which learns the hierarchical correlation context cues among internal and external triplet units. Notably, different from the highly semantic word token in BERT, the pose unit is a low-level signal originally locating in continuous space, which prevents the direct adoption of the BERT cross entropy objective. To this end, we bridge this semantic gap via coupling tokenization of the triplet unit. It adaptively extracts the discrete pseudo label from the pose triplet unit, which represents the semantic gesture / body state. After pre-training, we fine-tune the pre-trained encoder on the downstream SLR task, jointly with the newly added task-specific layer. Extensive experiments are conducted to validate the effectiveness of our proposed method, achieving new state-of-the-art performance on all four benchmarks with a notable gain.

cleong110 commented 1 month ago

image

cleong110 commented 1 month ago

c.f. #14 image

334208945-cf0fc7ac-7fae-422c-a714-e7250e7f72be

SignBERT+ figure

SignBERT+ uses ONLY hand poses. They say: "We organize the pre-extracted 2D poses of both hands as the visual token sequence."

Also different: SignBERT+ talks about joint, frame, and clip masking, and it sounds like BEST doesn't do different levels of masking, just the frame-level?

cleong110 commented 1 month ago

Back to BEST:

Our contributions are summarized as follows, • We propose a self-supervised pre-trainable framework. It leverages the BERT success, jointly with the specific design for the sign language domain.

• We organize the main hand and body movement as the pose triplet unit and propose the masked unit modeling (MUM) pretext task. To utilize the BERT objective, we generate the pseudo label for this task via coupling tokenization on the pose triplet unit.

• Extensive experiments on downstream SLR validate the effectiveness of our proposed method, achieving new state-of-the-art performance on four benchmarks with a notable gain.

cleong110 commented 1 month ago

Datasets:

We conduct experiments on four public sign language datasets, i.e., NMFs-CSL (Hu et al. 2021b), SLR500 (Huang et al. 2018), WLASL (Li et al. 2020a) and MSASL (Joze and Koller 2018).

cleong110 commented 1 month ago

image

cleong110 commented 1 month ago

image They compare with SignBERT (but not SignBERT+?) Yeah, they cite the 2021 one.

Global-local enhancement network for NMF-aware sign language recognition is "HMA".

cleong110 commented 1 month ago

Honestly, I think this one should not be merged until #14 is merged

cleong110 commented 1 month ago

Interesting:

Notably, different from the highly semantic word token in BERT, the pose unit is a low-level signal originally located in continuous space, which prevents the direct adoption of the BERT cross-entropy objective. To this end, we bridge this semantic gap via coupling tokenization of the triplet unit.

OK, what's that mean?

cleong110 commented 1 month ago

However, the main obstacle to leverage its success in video SLR is the different characteristics of the input signal. In NLP, the input word token is discrete and pre-defined with high semantics. In contrast, the video signal of sign language is continuous with the spatial and temporal dimensions. This signal is quite lowlevel, making the original BERT objective not applicable. Besides, since the sign language video is mainly characterized by hand and body movements, the direct adoption of the BERT framework may not be optimal.

cleong110 commented 1 month ago

Basically, our framework contains two stages, i.e., self-supervised pre-training and downstream fine-tuning. During pre-training, we propose the masked unit modeling (MUM) pretext task to capture the context cues. The input hand or body unit embedding is randomly masked, and then the framework reconstructs the masked unit from this corrupted input sequence. Similar to BERT, self-reconstruction is optimized via the cross-entropy objective. To this end, we jointly tokenize the pose triplet unit as the pseudo label, which represents the gesture/body state. After pre-training, the pre-trained Transformer encoder is fine-tuned with the newly added prediction head to perform the SLR task.

cleong110 commented 1 month ago

"Pose Triplet Unit"?

cleong110 commented 1 month ago

Oh this seems important, they're using a d-VAE. c.f. https://github.com/sign-language-processing/sign-language-processing.github.io/pull/37

The tokenization provides pseudo labels for our designed pretext task during pre-training. We utilize a discrete variational autoencoder (d-VAE) to jointly convert the pose triplet unit into the triplet tokens (body, left and right hand), motivated by VQ-VAE (Van Den Oord, Vinyals et al. 2017).

cleong110 commented 1 month ago

image

cleong110 commented 1 month ago

image

cleong110 commented 1 month ago

Our designed pretext task is MUM, which aims to exploit the hierarchical correlation context among internal and external triplet pose units. Given a pose sequence with a triplet pose unit of length T , we first randomly choose the α · T frames to process the mask operation. For clarification, we define three parts of the pose triplet unit as f l sign,t, f r sign,t and fb sign,t, respectively. If a unit is masked, a learnable masked token emask ∈ RDpart is utilized to replace each part of the triplet unit with 50% probability. Therefore, the masked triplet unit includes three masking cases: only hand masked, only body masked and hand-body masked.

cleong110 commented 1 month ago

OK, in my own words now, real informally:

So the thing to know about BEST is that they wanted to do BERT-style masked language modeling, but, you know, BERT assumes you've already got discrete, sematically meaningful tokens. So they were, like, well we've got these triplets of left hand, right hand, body (no face keypoints, mind you!), let's make those into triplets and couple them together. OK so then what? Well, van den Oord and Vinyals wrote "Neural discrete representation learning", which lets you take continuous signals and make them into discrete codes, with like a codebook and stuff. So they use one of things to make a tokenizer, they put the coupled triplets into it and get discrete tokens out. OK then what? Well then you've got the discrete tokens, and you mask hand, body, or both, and the transformer has to reconstruct the correct hand position, or body position, or whatever from surrounding context. "You shall know a ~word~ handshape from the company it keeps" I suppose. (and you use positional encodings to inform the model about temporal stuff, SignBERT+ does that too) And then they tried it on Isolated SLR and it seemed to work pretty good.

cleong110 commented 1 month ago

Also let's do a quick compare/contrast...

SignBERT+ results: image

BEST results: image

WINNER: SignBERT+

cleong110 commented 1 month ago

WLASL

SignBERT+ image

BEST image

WINNER: SignBERT+ again

cleong110 commented 1 month ago

SLR500:

SignBERT+ image BEST image

WINNER: TIE

cleong110 commented 1 month ago

OK, here's the transcript of the ChatGPT-assisted rewrite process: https://chatgpt.com/share/f42199e8-bd03-4ec3-9913-e5e832bb2885

My original summary was flawed in a number of ways.

@Zhao2023BESTPretrainingSignLanguageRecognition introduce BEST, a pretraining method based on masked modeling of pose sequences using a coupled tokenization scheme.
The method takes in pose triplet units (left hand, right hand, and upper-body with arms) as inputs.
The pose for each part of the triplet is tokenized into discrete codes [@van_den_Oord_2017NeuralDiscreteRepresentationLearning].
Then masked modeling is employed: any or all of the three parts may be masked, e.g. left hand, or right hand, or body+hand, or all of them...
Unlike @hu2023SignBertPlus, they do not mask multi-frame sequnces ("clips") or sub-frame portions of a pose unit (joints).
They validate their pretraining method isolated ISR (MS-ASL [@dataset:joze2018ms], WLASL [@dataset:li2020word], SLR500 [@huang2019attention3DCNNsSLR] and NMFs-CSL [@hu2021NMFAwareSLR]).
They experiment with both pose-to-gloss and video-to-gloss via fusion with I3D [@carreira2017quo].
Results on these datasets are SOTA compared to previous methods, and quite similar to those of SignBERT+ [@hu2023SignBertPlus]

Here's the first rewrite suggestions:

@Zhao2023BESTPretrainingSignLanguageRecognition introduced BEST, a pre-training method for sign language recognition based on masked modeling of pose sequences using a coupled tokenization scheme. 
The method uses pose triplet units, consisting of the left hand, right hand, and upper-body with arms, as inputs. 
Each part of the pose triplet is tokenized into discrete codes to facilitate the modeling process [@van_den_Oord_2017NeuralDiscreteRepresentationLearning]. 
Masked modeling is then employed, where any or all of the three parts may be masked, such as the left hand, right hand, or upper-body with hands. 
Unlike @hu2023SignBertPlus, BEST does not mask multi-frame sequences ("clips") or sub-frame portions of a pose unit (joints). 
They validated their pre-training method on isolated sign recognition (ISR) datasets: MS-ASL [@dataset
], WLASL [@dataset
], SLR500 [@huang2019attention3DCNNsSLR], and NMFs-CSL [@hu2021NMFAwareSLR]. 
They experimented with both pose-to-gloss and video-to-gloss tasks, incorporating fusion with I3D [@carreira2017quo]. 
The results on these datasets demonstrated state-of-the-art (SOTA) performance compared to previous methods and were similar to those achieved by SignBERT+ [@hu2023SignBertPlus].

Here's the second:

Zhao et al. (2023) introduce BEST, a pretraining framework designed to enhance sign language recognition (SLR) by leveraging BERT's masked modeling approach and a novel coupled tokenization scheme. 
The method uses pose triplet units, which include the left hand, right hand, and upper-body (including arms), as input data. 
Each component of the pose triplet unit is tokenized into discrete codes, following the approach by van den Oord et al. (2017). 
Masked modeling is then applied, where any or all components of the triplet (left hand, right hand, or upper-body) may be masked to learn the hierarchical correlation among them. 
Unlike SignBERT+ by Hu et al. (2023), the BEST framework does not mask multi-frame sequences ("clips") or sub-frame portions (joints) of a pose unit. 
The authors validate their pretraining method on isolated sign recognition (ISR) tasks using the MS-ASL [@dataset
], WLASL [@dataset
], SLR500 [@huang2019attention3DCNNsSLR], and NMFs-CSL [@hu2021NMFAwareSLR] datasets. 
The authors conduct experiments on both pose-to-gloss and video-to-gloss tasks by integrating their method with the Inflated 3D ConvNet (I3D) [@carreira2017quo]. 
The results demonstrate state-of-the-art (SOTA) performance on all evaluated datasets, showing notable improvements over previous methods and comparable results to SignBERT+ [@hu2023SignBertPlus].
cleong110 commented 1 month ago

In addition the PR had various suggestions, e.g. https://github.com/sign-language-processing/sign-language-processing.github.io/pull/61#discussion_r1634494728

cleong110 commented 1 month ago
@Zhao2023BESTPretrainingSignLanguageRecognition introduce BEST (BERT Pre-training for Sign Language Recognition with Coupling Tokenization), a pre-training method based on masked modeling of pose sequences using a coupled tokenization scheme.
This method takes pose triplet units (left hand, right hand, and upper-body with arms) as inputs, each tokenized into discrete codes [@van_den_Oord_2017NeuralDiscreteRepresentationLearning].
Masked modeling is then applied, where any or all components of the triplet (left hand, right hand, or upper-body) may be masked, to learn hierarchical correlations among them.
Unlike @hu2023SignBertPlus, BEST does not mask multi-frame pose sequences or individual joints. 
The authors validate their pre-training method on isolated sign recognition (ISR) tasks using MS-ASL [@dataset:joze2018ms], WLASL [@dataset:li2020word], SLR500 [@huang2019attention3DCNNsSLR], and NMFs-CSL [@hu2021NMFAwareSLR].
Besides pose-to-gloss, they also experiment with video-to-gloss tasks via fusion with I3D [@carreira2017quo].
Results on these datasets demonstrate state-of-the-art performance compared to previous methods and are comparable to those of SignBERT+ [@hu2023SignBertPlus].
cleong110 commented 1 month ago

Updated: https://chatgpt.com/share/f42199e8-bd03-4ec3-9913-e5e832bb2885

cleong110 commented 1 month ago

Merged