cleong110 / sign-language-processing.github.io

Documentation and background of sign language processing
0 stars 0 forks source link

SignBERT+ (and SignBERT) #14

Closed cleong110 closed 3 weeks ago

cleong110 commented 1 month ago

image

Given that SignBERT+ is a direct "sequel" of SignBERT, I think it could be good to do them as one PR.

https://ieeexplore.ieee.org/document/10109128 SignBERT+ https://openaccess.thecvf.com/content/ICCV2021/html/Hu_SignBERT_Pre-Training_of_Hand-Model-Aware_Representation_for_Sign_Language_Recognition_ICCV_2021_paper.html SignBERT

Checklist

PR:

Writing/style:

Additional:

cleong110 commented 1 month ago

“This work is an extension of the conference paper [5] with improvement in a number of aspects. 1) Considering the characteristics of sign language, we further introduce spatial-temporal global position encoding into embedding, along with the masked clip modeling for modeling temporal dynamics. Those new techniques further bring a notable performance gain. 2) We extend the original framework to two more downstream tasks in video-based sign language understanding, i.e., continuous SLR and SLT.” (Hu et al., 2023, p. 2)

Here's what the authors have to say about the difference between SignBERT and SignBERT+

cleong110 commented 1 month ago

OK what is the thing to know about these two papers? Self-supervised pretraining with SL-specific prior, basically. They're incorporating domain knowledge.

tl;dr Self-supervised pose sequence pretraining specifically designed for SLP. Then you can use that pretrained encoder and finetune on downstream tasks like Isolated SLR, Continuous SLR, or SLT.

They do try it out on all three, including Sign2Text.

They attribute "S2T setting" to N. Cihan Camgoz, S. Hadfield, O. Koller, H. Ney, and R. Bowden, “Neural sign language translation,” in IEEE/CVF International Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7784–7793.

cleong110 commented 1 month ago

Inputs: 2D Pose sequences from MMPose, 133 keypoints. Outputs: embeddings, basically.

cleong110 commented 1 month ago

Datasets:

"During the pre-training stage, the utilized data includes the training data from all aforementioned sign datasets, along with other collected data from [84], [85]. In total, the pre-training data volume is 230,246 videos."

[84] H. Hu, W. Zhou, J. Pu, and H. Li, “Global-local enhancement network for NMFs-aware sign language recognition,” ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), vol. 17, no. 3, pp. 1–18, 2021.

[85] A. Duarte, S. Palaskar, L. Ventura, D. Ghadiyaram, K. DeHaan, F. Metze, J. Torres, and X. Giro-i Nieto, “How2sign: a large-scale multimodal dataset for continuous american sign language,” in IEEE/CVF International Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 2735–2744.

S. Yuan, Q. Ye, G. Garcia-Hernando, and T.-K. Kim, “The 2017 hands in the million challenge on 3D hand pose estimation,” arXiv, pp. 1–7, 2017.

cleong110 commented 1 month ago

Official Citation from IEEE, I am using hu2023SignBertPlus as the key

@ARTICLE{10109128,
  author={Hu, Hezhen and Zhao, Weichao and Zhou, Wengang and Li, Houqiang},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence}, 
  title={SignBERT+: Hand-Model-Aware Self-Supervised Pre-Training for Sign Language Understanding}, 
  year={2023},
  volume={45},
  number={9},
  pages={11221-11239},
  keywords={Task analysis;Assistive technologies;Gesture recognition;Visualization;Bit error rate;Transformers;Hidden Markov models;Self-supervised pre-training;masked modeling strategies;model-aware hand prior;sign language understanding},
  doi={10.1109/TPAMI.2023.3269220}}
cleong110 commented 1 month ago

Official citation for NMFs-CSL dataset, but using our normal key style

@article{hu2021NMFAwareSLR,
    author = {Hu, Hezhen and Zhou, Wengang and Pu, Junfu and Li, Houqiang},
    title = {Global-Local Enhancement Network for NMF-Aware Sign Language Recognition},
    year = {2021},
    issue_date = {August 2021},
    publisher = {Association for Computing Machinery},
    address = {New York, NY, USA},
    volume = {17},
    number = {3},
    issn = {1551-6857},
    url = {https://doi.org/10.1145/3436754},
    doi = {10.1145/3436754},
    journal = {ACM Trans. Multimedia Comput. Commun. Appl.},
    month = {jul},
    articleno = {80},
    numpages = {19}
}
cleong110 commented 1 month ago

Looking for the official citation for HANDS17:

Also, HANDS2019 is a thing.

cleong110 commented 1 month ago

Oh, and here's the official citation for SignBERT, taken from https://openaccess.thecvf.com/content/ICCV2021/html/Hu_SignBERT_Pre-Training_of_Hand-Model-Aware_Representation_for_Sign_Language_Recognition_ICCV_2021_paper.html

@InProceedings{Hu_2021_ICCV,
    author    = {Hu, Hezhen and Zhao, Weichao and Zhou, Wengang and Wang, Yuechen and Li, Houqiang},
    title     = {SignBERT: Pre-Training of Hand-Model-Aware Representation for Sign Language Recognition},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2021},
    pages     = {11087-11096}
}

I will use hu2021SignBert as the key

cleong110 commented 1 month ago

image

Pretraining strategy is in section 3.2. They randomly pick some portion of the pose tokens, and do one of:

What's a "token"?

cleong110 commented 1 month ago

Apparently they use MANO hand model in the decoder? "hand-model-aware decoder" they say, and cite https://dl.acm.org/doi/abs/10.1145/3130800.3130883

cleong110 commented 1 month ago

As far as I can tell they treat each pose in the sequence as a token. So if the pose estimation gives them 30 poses for 30 frames that is 30 tokens.

cleong110 commented 1 month ago

OK, I think it's time. Let's build our initial summary and prompt ChatGPT for help.

Here's my initial version, which I add to the "pose-to-text" section.

@hu2023SignBertPlus introduce SignBERT+, a hand-model-aware self-supervised pretraining method which they validate on sign language recognition (SLR) and sign language translation (SLT). 
Collecting over 230k videos from a number of datasets, they extract pose sequences using MMPose [@mmpose2020]
They then treat these pose sequences as sequences of visual tokens, and pretrain their encoder through masking of joints, frames, and short clips.
They incorporate a statistical hand model [@romero2017MANOHandModel] to constrain their decoder. 
They finetune and validate on a number of downstream tasks including:
- isolated SLR on MS-ASL [@dataset:joze2018ms], WLASL [@dataset:li2020word], and SLR500 [@huang2019attention3DCNNsSLR]
- Continuous SLR and SLT using RWTH-PHOENIX-Weather [koller2015ContinuousSLR] and RWTH-PHOENIX-Weather 2014T [dataset:forster2014extensions;@cihan2018neural]
Results show state-of-the-art performance.
cleong110 commented 1 month ago

Building my prompt:

I am writing a summary of an academic paper. Based on what I have provided below, can you rewrite my first version of the summary to be more concise and professional? Please provide 3 alternative rewrites, and explain your suggested changes, as well as any issues with writing quality or inaccuracy in my original summary. Be sure the summaries you provide are accurate to the figure and the the abstract. If I have missed a key contribution from the paper please note that and suggest additions. If something is not clear request clarification and I can provide additional snippets. Please cite your sources for important details, e.g. "from the abstract" or "based on the full text". My summary is in markdown syntax and contains citations to a BibTex bibliography, the citations begin with "@". Please use the same citation style.

In addition, please follow the following style guide:

STYLE GUIDE
- **Citations**: Use the format `@authorYearKeyword` for inline citations, and `[@authorYearKeyword]` for citations wrapped in parentheses. To include multiple citations ,use a semicolon (;) to separate them (e.g., "@authorYearKeyword;@authorYearKeyword").
- **Background & Related Work**: Use simple past tense to describe previous work (e.g., "@authorYearKeyword used...").
- **Abbreviations**: Define abbreviations in parentheses after the full term (e.g., Langue des Signes Française (LSF)).
- **Percentages**: Use the percent sign (%) with no space between the number and the sign (e.g., 95%).
- **Spacing**: Use a single space after periods and commas.
- **Hyphenation**: Use hyphens (-) for compound adjectives (e.g., video-to-pose).
- **Lists**: Use "-" for list items, followed by a space.
- **Code**: Use backticks (`) for inline code, and triple backticks (```) for code blocks.
- **Numbers**: Spell out numbers less than 10, and use numerals for 10 and greater.
- **Contractions**: Avoid contractions (e.g., use "do not" instead of "don't").
- **Compound Words**: Use a forward slash (/) to separate alternative compound words (e.g., 2D / 3D).
- **Phrasing**: Prefer active voice over passive voice (e.g., "The authors used..." instead of "The work was used by the authors...").
- **Structure**: Present information in a logical order.
- **Capitalization**: Capitalize the first word of a sentence, and proper nouns.
- **Emphasis**: Use italics for emphasis by wrapping a word with asterisks (e.g., *emphasis*).
- **Quote marks**: Use double quotes (").
- **Paragraphs**: When a subsection header starts with ######, add "{-}" to the end of the subsection title to indicate a new paragraph. If it starts with #, ##, ###, ####, or ##### do not add the "{-}".
- **Mathematics**: Use LaTeX math notation (e.g., $x^2$) wrapped in dollar signs ($).

All right, here is information about the paper I am trying to summarize:

Paper Title: "SignBERT+: Hand-Model-Aware Self-Supervised Pre-Training for Sign Language Understanding"

Abstract: 
"Hand gesture serves as a crucial role during the expression of sign language. Current deep learning based methods for sign language understanding (SLU) are prone to over-fitting due to insufficient sign data resource and suffer limited interpretability. In this paper, we propose the first self-supervised pre-trainable SignBERT+ framework with model-aware hand prior incorporated. In our framework, the hand pose is regarded as a visual token, which is derived from an off-the-shelf detector. Each visual token is embedded with gesture state and spatial-temporal position encoding. To take full advantage of current sign data resource, we first perform self-supervised learning to model its statistics. To this end, we design multi-level masked modeling strategies (joint, frame and clip) to mimic common failure detection cases. Jointly with these masked modeling strategies, we incorporate model-aware hand prior to better capture hierarchical context over the sequence. After the pre-training, we carefully design simple yet effective prediction heads for downstream tasks. To validate the effectiveness of our framework, we perform extensive experiments on three main SLU tasks, involving isolated and continuous sign language recognition (SLR), and sign language translation (SLT). Experimental results demonstrate the effectiveness of our method, achieving new state-of-the-art performance with a notable gain."

Full Text: see attached PDF

My Summary:  
"@hu2023SignBertPlus introduce SignBERT+, a hand-model-aware self-supervised pretraining method which they validate on sign language recognition (SLR) and sign language translation (SLT).
Collecting over 230k videos from a number of datasets, they extract pose sequences using MMPose [@mmpose2020].
They then treat these pose sequences as sequences of visual tokens, and pretrain their encoder through masking of joints, frames, and short clips.
When embedding pose sequences they use temporal positional encodings.
They incorporate a statistical hand model [@romero2017MANOHandModel] to constrain their decoder.
They finetune and validate on a number of downstream tasks:

- isolated SLR on MS-ASL [@dataset:joze2018ms], WLASL [@dataset:li2020word], and SLR500 [@huang2019attention3DCNNsSLR].
- Continuous SLR using RWTH-PHOENIX-Weather [koller2015ContinuousSLR] and RWTH-PHOENIX-Weather 2014T [dataset:forster2014extensions;@cihan2018neural].
- SLT using RWTH-PHOENIX-Weather 2014T [dataset:forster2014extensions;@cihan2018neural].

Results show state-of-the-art performance on these tasks."

All right, remember my initial instructions, please go ahead and provide me the requested concise, professional rewrite suggestions for my summary, with the requested explanations, citations, and following the style guide. In particular I feel that my summary lacks clarity on the "Hand-model-aware decoder".
cleong110 commented 1 month ago

Resulting ChatGPT conversation: https://chatgpt.com/share/1cf76e17-b778-49c4-9887-d12770fa922a. Main gist of the suggestions is to

cleong110 commented 1 month ago

Metrics:

cleong110 commented 1 month ago

OK, I think I get it about MANO and how that helps. Based on figure 2... image

So basically it guides/hints the model during pretraining to reconstruct the masked poses more accurately. "Hey I dropped some joints. But here's a statistical hand model of how human hands are in real life. Knowing that, can you reconstruct properly?"

cleong110 commented 1 month ago

OK, rewriting/synthesizing... @hu2023SignBertPlus introduce SignBERT+, a self-supervised pretraining method for sign language understanding (SLU) incorporating a hand-model-aware approach. They extract pose sequences from over 230k videos using MMPose [@mmpose2020], treating these as visual tokens embedded with temporal positional encodings. They pretrain using multi-level masked modeling (joints, frames, clips) and integrate a statistical hand model [@romero2017MANOHandModel] to enhance the decoder's accuracy and constrain its predictions for anatomical realism. Validation on isolated SLR (MS-ASL [@dataset:joze2018ms], WLASL [@dataset:li2020word], SLR500 [@huang2019attention3DCNNsSLR].), continuous SLR (RWTH-PHOENIX-Weather [@koller2015ContinuousSLR]), and SLT (RWTH-PHOENIX-Weather 2014T [@dataset:forster2014extensions;@cihan2018neural]) demonstrates state-of-the-art performance.

cleong110 commented 3 weeks ago

merged