SignBERT+ (and SignBERT)

cleong110 commented 1 month ago

Given that SignBERT+ is a direct "sequel" of SignBERT, I think it could be good to do them as one PR.

https://ieeexplore.ieee.org/document/10109128 SignBERT+ https://openaccess.thecvf.com/content/ICCV2021/html/Hu_SignBERT_Pre-Training_of_Hand-Model-Aware_Representation_for_Sign_Language_Recognition_ICCV_2021_paper.html SignBERT

Checklist

[x] sync, pull and merge master first!
[x] Search for the correct citation on Semantic Scholar
[x] Make a new branch ("You should always branch out from master")
[x] Add citation to references.bib. If dataset, prepend with dataset:. Exclude wordy abstracts. (better BibTex extension to Zotero can exclude keys)
[x] Check for egregious {} in the bibtex
[x] write a summary and add to the appropriate section in index.md.
[x] Make sure the citation keys match.
[x] Add a newline after each sentence in a paragraph. Still shows up as one paragraph but makes git stuff easier.
[x] ChatGPT 3.5 can suggest rewrites and improve writing.
[x] Check if acronyms are explained
[x] Copy-Paste into https://dillinger.io/, see if it looks OK
[ ] Make a PR from the branch on my fork to master on the source repo

PR:

[ ] sync master of both forks
[ ] git pull master on local
[ ] git merge master on branch
[ ] git push
[ ] THEN make the PR

Writing/style:

[ ] try to describe what they did, not what the general process is.
[ ] Don't have to describe what's in a repo
[ ] something like "Three Letter Acronym (TLA)" is how you introduce acronyms
[ ] Look through the Style Guide on the README

Additional:

[ ] Check if we've got mentions of the datasets they used. If not, add the citations at least, and make a note to add them to the table.
[ ] NMFs-CSL https://ustc-slr.github.io/datasets/2020_nmfs_csl/

cleong110 commented 1 month ago

“This work is an extension of the conference paper [5] with improvement in a number of aspects. 1) Considering the characteristics of sign language, we further introduce spatial-temporal global position encoding into embedding, along with the masked clip modeling for modeling temporal dynamics. Those new techniques further bring a notable performance gain. 2) We extend the original framework to two more downstream tasks in video-based sign language understanding, i.e., continuous SLR and SLT.” (Hu et al., 2023, p. 2)

Here's what the authors have to say about the difference between SignBERT and SignBERT+

cleong110 commented 1 month ago

OK what is the thing to know about these two papers? Self-supervised pretraining with SL-specific prior, basically. They're incorporating domain knowledge.

tl;dr Self-supervised pose sequence pretraining specifically designed for SLP. Then you can use that pretrained encoder and finetune on downstream tasks like Isolated SLR, Continuous SLR, or SLT.

They do try it out on all three, including Sign2Text.

They attribute "S2T setting" to N. Cihan Camgoz, S. Hadfield, O. Koller, H. Ney, and R. Bowden, “Neural sign language translation,” in IEEE/CVF International Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7784–7793.

cleong110 commented 1 month ago

Inputs: 2D Pose sequences from MMPose, 133 keypoints. Outputs: embeddings, basically.

cleong110 commented 1 month ago

Datasets:

HANDS17, How2Sign and NMFs-CSL for pretraining, as well as all the others apparently.
Isolated SLR: "MSASL [17], WLASL [18], and SLR500 [16]"
Continuous SLR: "RWTH-Phoenix [14] and RWTH-PhoenixT [33]."
SLT: "RWTH-PhoenixT"

"During the pre-training stage, the utilized data includes the training data from all aforementioned sign datasets, along with other collected data from [84], [85]. In total, the pre-training data volume is 230,246 videos."

[84] H. Hu, W. Zhou, J. Pu, and H. Li, “Global-local enhancement network for NMFs-aware sign language recognition,” ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), vol. 17, no. 3, pp. 1–18, 2021.

[85] A. Duarte, S. Palaskar, L. Ventura, D. Ghadiyaram, K. DeHaan, F. Metze, J. Torres, and X. Giro-i Nieto, “How2sign: a large-scale multimodal dataset for continuous american sign language,” in IEEE/CVF International Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 2735–2744.

S. Yuan, Q. Ye, G. Garcia-Hernando, and T.-K. Kim, “The 2017 hands in the million challenge on 3D hand pose estimation,” arXiv, pp. 1–7, 2017.

cleong110 commented 1 month ago

Official Citation from IEEE, I am using hu2023SignBertPlus as the key

@ARTICLE{10109128,
  author={Hu, Hezhen and Zhao, Weichao and Zhou, Wengang and Li, Houqiang},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence}, 
  title={SignBERT+: Hand-Model-Aware Self-Supervised Pre-Training for Sign Language Understanding}, 
  year={2023},
  volume={45},
  number={9},
  pages={11221-11239},
  keywords={Task analysis;Assistive technologies;Gesture recognition;Visualization;Bit error rate;Transformers;Hidden Markov models;Self-supervised pre-training;masked modeling strategies;model-aware hand prior;sign language understanding},
  doi={10.1109/TPAMI.2023.3269220}}

cleong110 commented 1 month ago

Official citation for NMFs-CSL dataset, but using our normal key style

@article{hu2021NMFAwareSLR,
    author = {Hu, Hezhen and Zhou, Wengang and Pu, Junfu and Li, Houqiang},
    title = {Global-Local Enhancement Network for NMF-Aware Sign Language Recognition},
    year = {2021},
    issue_date = {August 2021},
    publisher = {Association for Computing Machinery},
    address = {New York, NY, USA},
    volume = {17},
    number = {3},
    issn = {1551-6857},
    url = {https://doi.org/10.1145/3436754},
    doi = {10.1145/3436754},
    journal = {ACM Trans. Multimedia Comput. Commun. Appl.},
    month = {jul},
    articleno = {80},
    numpages = {19}
}

cleong110 commented 1 month ago

Looking for the official citation for HANDS17:

Citation in the paper leads here: https://arxiv.org/abs/1707.02237
The link in the paper to the challenge website does not work: http://icvl.ee.ic.ac.uk/hands17/challenge/
Used the Wayback machine, found archived versions of the site, and it was still up as of December 2023: http://web.archive.org/web/20231205093007/http://icvl.ee.ic.ac.uk/hands17/challenge/
apparently they went on to create HANDS2019, as well as BigHand2.2M, and sampled from First-Person Hand Action Dataset: FHAD (https://guiggh.github.io/publications/first-person-hands/)
https://forms.gle/5BFwVAd2kFxpdEaf7 is the link to request access to HANDS2017, still works
https://forms.gle/48q4N8coVnjq3XJ27 is for BigHand2.2M
Apparently they did a paper on the results of their 2017 challenge: https://arxiv.org/abs/1712.03917... which cites HANDS17 as the Hands in Motion challenge aka HIM2017, and just does the Arxiv paper.

Also, HANDS2019 is a thing.

Hands 2019 website is up: https://sites.google.com/view/hands2019/challenge

cleong110 commented 1 month ago

Oh, and here's the official citation for SignBERT, taken from https://openaccess.thecvf.com/content/ICCV2021/html/Hu_SignBERT_Pre-Training_of_Hand-Model-Aware_Representation_for_Sign_Language_Recognition_ICCV_2021_paper.html

@InProceedings{Hu_2021_ICCV,
    author    = {Hu, Hezhen and Zhao, Weichao and Zhou, Wengang and Wang, Yuechen and Li, Houqiang},
    title     = {SignBERT: Pre-Training of Hand-Model-Aware Representation for Sign Language Recognition},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2021},
    pages     = {11087-11096}
}

I will use hu2021SignBert as the key

cleong110 commented 1 month ago

Pretraining strategy is in section 3.2. They randomly pick some portion of the pose tokens, and do one of:

Either mask out or move/jitter around some number of the joints.
Drop out the whole frame. The idea is that sometimes on complex backgrounds pose detectors don't output anything at all.
Drop out a whole short clip. To deal with the case where because of motion blur the pose detection cuts out for a short time.

What's a "token"?

cleong110 commented 1 month ago

Apparently they use MANO hand model in the decoder? "hand-model-aware decoder" they say, and cite https://dl.acm.org/doi/abs/10.1145/3130800.3130883

cleong110 commented 1 month ago

As far as I can tell they treat each pose in the sequence as a token. So if the pose estimation gives them 30 poses for 30 frames that is 30 tokens.

cleong110 commented 1 month ago

OK, I think it's time. Let's build our initial summary and prompt ChatGPT for help.

Here's my initial version, which I add to the "pose-to-text" section.

@hu2023SignBertPlus introduce SignBERT+, a hand-model-aware self-supervised pretraining method which they validate on sign language recognition (SLR) and sign language translation (SLT). 
Collecting over 230k videos from a number of datasets, they extract pose sequences using MMPose [@mmpose2020]
They then treat these pose sequences as sequences of visual tokens, and pretrain their encoder through masking of joints, frames, and short clips.
They incorporate a statistical hand model [@romero2017MANOHandModel] to constrain their decoder. 
They finetune and validate on a number of downstream tasks including:
- isolated SLR on MS-ASL [@dataset:joze2018ms], WLASL [@dataset:li2020word], and SLR500 [@huang2019attention3DCNNsSLR]
- Continuous SLR and SLT using RWTH-PHOENIX-Weather [koller2015ContinuousSLR] and RWTH-PHOENIX-Weather 2014T [dataset:forster2014extensions;@cihan2018neural]
Results show state-of-the-art performance.

cleong110 commented 1 month ago

Building my prompt:

I am writing a summary of an academic paper. Based on what I have provided below, can you rewrite my first version of the summary to be more concise and professional? Please provide 3 alternative rewrites, and explain your suggested changes, as well as any issues with writing quality or inaccuracy in my original summary. Be sure the summaries you provide are accurate to the figure and the the abstract. If I have missed a key contribution from the paper please note that and suggest additions. If something is not clear request clarification and I can provide additional snippets. Please cite your sources for important details, e.g. "from the abstract" or "based on the full text". My summary is in markdown syntax and contains citations to a BibTex bibliography, the citations begin with "@". Please use the same citation style.

In addition, please follow the following style guide:

STYLE GUIDE
- **Citations**: Use the format `@authorYearKeyword` for inline citations, and `[@authorYearKeyword]` for citations wrapped in parentheses. To include multiple citations ,use a semicolon (;) to separate them (e.g., "@authorYearKeyword;@authorYearKeyword").
- **Background & Related Work**: Use simple past tense to describe previous work (e.g., "@authorYearKeyword used...").
- **Abbreviations**: Define abbreviations in parentheses after the full term (e.g., Langue des Signes Française (LSF)).
- **Percentages**: Use the percent sign (%) with no space between the number and the sign (e.g., 95%).
- **Spacing**: Use a single space after periods and commas.
- **Hyphenation**: Use hyphens (-) for compound adjectives (e.g., video-to-pose).
- **Lists**: Use "-" for list items, followed by a space.
- **Code**: Use backticks (`) for inline code, and triple backticks (```) for code blocks.
- **Numbers**: Spell out numbers less than 10, and use numerals for 10 and greater.
- **Contractions**: Avoid contractions (e.g., use "do not" instead of "don't").
- **Compound Words**: Use a forward slash (/) to separate alternative compound words (e.g., 2D / 3D).
- **Phrasing**: Prefer active voice over passive voice (e.g., "The authors used..." instead of "The work was used by the authors...").
- **Structure**: Present information in a logical order.
- **Capitalization**: Capitalize the first word of a sentence, and proper nouns.
- **Emphasis**: Use italics for emphasis by wrapping a word with asterisks (e.g., *emphasis*).
- **Quote marks**: Use double quotes (").
- **Paragraphs**: When a subsection header starts with ######, add "{-}" to the end of the subsection title to indicate a new paragraph. If it starts with #, ##, ###, ####, or ##### do not add the "{-}".
- **Mathematics**: Use LaTeX math notation (e.g., $x^2$) wrapped in dollar signs ($).

All right, here is information about the paper I am trying to summarize:

Paper Title: "SignBERT+: Hand-Model-Aware Self-Supervised Pre-Training for Sign Language Understanding"

Abstract: 
"Hand gesture serves as a crucial role during the expression of sign language. Current deep learning based methods for sign language understanding (SLU) are prone to over-fitting due to insufficient sign data resource and suffer limited interpretability. In this paper, we propose the first self-supervised pre-trainable SignBERT+ framework with model-aware hand prior incorporated. In our framework, the hand pose is regarded as a visual token, which is derived from an off-the-shelf detector. Each visual token is embedded with gesture state and spatial-temporal position encoding. To take full advantage of current sign data resource, we first perform self-supervised learning to model its statistics. To this end, we design multi-level masked modeling strategies (joint, frame and clip) to mimic common failure detection cases. Jointly with these masked modeling strategies, we incorporate model-aware hand prior to better capture hierarchical context over the sequence. After the pre-training, we carefully design simple yet effective prediction heads for downstream tasks. To validate the effectiveness of our framework, we perform extensive experiments on three main SLU tasks, involving isolated and continuous sign language recognition (SLR), and sign language translation (SLT). Experimental results demonstrate the effectiveness of our method, achieving new state-of-the-art performance with a notable gain."

Full Text: see attached PDF

My Summary:  
"@hu2023SignBertPlus introduce SignBERT+, a hand-model-aware self-supervised pretraining method which they validate on sign language recognition (SLR) and sign language translation (SLT).
Collecting over 230k videos from a number of datasets, they extract pose sequences using MMPose [@mmpose2020].
They then treat these pose sequences as sequences of visual tokens, and pretrain their encoder through masking of joints, frames, and short clips.
When embedding pose sequences they use temporal positional encodings.
They incorporate a statistical hand model [@romero2017MANOHandModel] to constrain their decoder.
They finetune and validate on a number of downstream tasks:

- isolated SLR on MS-ASL [@dataset:joze2018ms], WLASL [@dataset:li2020word], and SLR500 [@huang2019attention3DCNNsSLR].
- Continuous SLR using RWTH-PHOENIX-Weather [koller2015ContinuousSLR] and RWTH-PHOENIX-Weather 2014T [dataset:forster2014extensions;@cihan2018neural].
- SLT using RWTH-PHOENIX-Weather 2014T [dataset:forster2014extensions;@cihan2018neural].

Results show state-of-the-art performance on these tasks."

All right, remember my initial instructions, please go ahead and provide me the requested concise, professional rewrite suggestions for my summary, with the requested explanations, citations, and following the style guide. In particular I feel that my summary lacks clarity on the "Hand-model-aware decoder".

cleong110 commented 1 month ago

Resulting ChatGPT conversation: https://chatgpt.com/share/1cf76e17-b778-49c4-9887-d12770fa922a. Main gist of the suggestions is to

do some variation on " a statistical hand model [@romero2017MANOHandModel] to enhance the decoder's accuracy."
integrate the three tasks into one sentence, instead of a bullet list
do something like "use multi-level masking (joints, frames, clips)"
"extract pose sequences ... treats these as visual tokens".

cleong110 commented 1 month ago

Metrics:

Pose Estimation: Percentage of Correct Keypoints (PCK) score and the Area Under the Curve (AUC) on the PCK threshold ranging from 20 to 40 pixels
Isolated SLR: "We utilize the accuracy metrics, including per-instance (P-I) and per-class (P-C) metrics. P-I and P-C denote the average accuracy over all the instances and classes, respectively. Following previous works [5], [13], we report Top-1 and Top-5 P-I and P-C metrics...Since each class in SLR500 contains the same number of samples, P-I is equal to P-C and we only report one of them."
"continuous SLR, we utilize Word Error Rate (WER) as the evaluation metric."
SLT: BLEU and ROUGE, specifically ROUGE-L F1 score

cleong110 commented 1 month ago

OK, I think I get it about MANO and how that helps. Based on figure 2...

So basically it guides/hints the model during pretraining to reconstruct the masked poses more accurately. "Hey I dropped some joints. But here's a statistical hand model of how human hands are in real life. Knowing that, can you reconstruct properly?"

cleong110 commented 1 month ago

OK, rewriting/synthesizing... @hu2023SignBertPlus introduce SignBERT+, a self-supervised pretraining method for sign language understanding (SLU) incorporating a hand-model-aware approach. They extract pose sequences from over 230k videos using MMPose [@mmpose2020], treating these as visual tokens embedded with temporal positional encodings. They pretrain using multi-level masked modeling (joints, frames, clips) and integrate a statistical hand model [@romero2017MANOHandModel] to enhance the decoder's accuracy and constrain its predictions for anatomical realism. Validation on isolated SLR (MS-ASL [@dataset:joze2018ms], WLASL [@dataset:li2020word], SLR500 [@huang2019attention3DCNNsSLR].), continuous SLR (RWTH-PHOENIX-Weather [@koller2015ContinuousSLR]), and SLT (RWTH-PHOENIX-Weather 2014T [@dataset:forster2014extensions;@cihan2018neural]) demonstrates state-of-the-art performance.

cleong110 commented 3 weeks ago

merged

cleong110 / sign-language-processing.github.io

SignBERT+ (and SignBERT) #14