Got started on "LLMs are Good Sign Language Translators"

[x] Pull and merge master first!
[x] Make a new branch ("You should always branch out from master"): paper/gongLLMsAreGood2024
[x] Search for the correct citation on Semantic Scholar
[x] Add citation to references.bib. If dataset, prepend with dataset:. Exclude wordy abstracts. (better BibTex extension to Zotero can exclude keys)
[x] write a summary and add to the appropriate section in index.md.
[x] Make sure the citation keys match.
[x] ChatGPT 3.5 can suggest rewrites and improve writing.
[x] Add a newline after each sentence in a paragraph. Still shows up as one paragraph but makes git stuff easier.
[x] Check if acronyms are explained
[x] Make a PR from the branch on my fork to master on the source repo

Summary rewrite process:

I prompted ChatGPT with the following prompt:

I am writing a summary of an academic paper. Based on what I have provided, that is the abstract, figure 1, figure 1's caption, a snippet from the paper, and my summary, can you rewrite the summary to be more concise and professional? Please provide 3 alternative rewrites, and explain your suggested changes, as well as any issues with writing quality or inaccuracy in my original summary.. Be sure the summaries you provide are accurate to the figure and the the abstract. If something is not clear request clarification and I can provide additional snippets.

I also provided a screenshot of Figure 1, as well as a snippet regarding "language-like representations".

Then, I also provided the entire text of the Introduction and Conclusion sections and asked for another rewrite:

Thank you! These are much improved. I now also provide the Introduction and Conclusion sections. Please rewrite versions 1, 2, and 3 again in light of this information, explaining for each of them how the summaries are improved.

ChatGPT provided me with these three options:

Version 1:

This paper addresses the challenge of Sign Language Translation (SLT) by leveraging large language models (LLMs). The proposed SignLLM framework translates sign videos into a format compatible with LLMs by imparting language-like characteristics to the sign videos. SignLLM comprises two key modules: (1) The Vector-Quantized Visual Sign (VQ-Sign) module, which converts sign videos into discrete character-level tokens, and (2) the Codebook Reconstruction and Alignment (CRA) module, which reorganizes these tokens into word-level representations using optimal transport techniques. A sign-text alignment loss further refines the semantic compatibility between sign and text tokens. Evaluations on RWTH-PHOENIX-2014T and CSL-Daily datasets show that SignLLM achieves state-of-the-art performance without relying on gloss annotations.

Improvements:

Contextual Introduction: Including the challenge of SLT and the role of sign languages provides a better foundation.

Detailed Methodology: Clarifying how SignLLM transforms sign videos into LLM-friendly formats enhances understanding.

Results Emphasis: Highlighting the empirical results underscores the framework's effectiveness.

Version 2:

The SignLLM framework is proposed to enhance Sign Language Translation (SLT) by utilizing the capabilities of large language models (LLMs). It transforms sign videos into discrete, language-like representations to improve compatibility with LLMs. The framework consists of two modules: (1) The Vector-Quantized Visual Sign (VQ-Sign) module, which quantizes sign videos into character-level tokens, and (2) the Codebook Reconstruction and Alignment (CRA) module, which converts these tokens into word-level representations via optimal transport. A sign-text alignment loss is used to narrow the semantic gap between sign and text tokens. Experiments on RWTH-PHOENIX-2014T and CSL-Daily benchmarks demonstrate that SignLLM achieves state-of-the-art results without gloss annotations.

Improvements:

Clear Motivation: Outlining the motivation from the introduction emphasizes the importance of the task.

Module Functions: Detailing the function of each module provides clearer insight into the framework's operation.

Performance Highlight: Stressing the state-of-the-art results validates the effectiveness of the approach.

Version 3:

The paper introduces SignLLM, a novel framework for Sign Language Translation (SLT) that leverages the strengths of large language models (LLMs). SignLLM converts sign videos into language-like representations compatible with LLMs through two modules: (1) The Vector-Quantized Visual Sign (VQ-Sign) module, which translates sign videos into discrete character-level tokens, and (2) the Codebook Reconstruction and Alignment (CRA) module, which restructures these tokens into word-level representations using optimal transport techniques. A sign-text alignment loss further enhances the semantic alignment between sign and text tokens. The framework achieves state-of-the-art results on RWTH-PHOENIX-2014T and CSL-Daily datasets without relying on gloss annotations.

Improvements:

Comprehensive Background: Introducing the problem and motivation from the introduction sets a solid stage for the summary.

Methodology Clarity: Explaining the transformation process in detail makes the methodology more understandable.

Empirical Validation: Emphasizing the empirical success on benchmarks demonstrates the practical impact of the framework.

After which I wondered: OK, we've basically restated the paper/resummarized a couple times. What would be the value-add of this over just copy-pasting the abstract? What can I do to this to improve the overall lit review? Perhaps the "connecting tissue"? Or making connections with other works? Comparing and contrasting between works, tracing the lineage and chain of thought from work to work? So for this one the key related works would probably be SLTUNET, Gloss Attention for Gloss-free Sign Language Translation (2023) and Gloss-free sign language translation: Improving from visual-language pretraining, 2023 aka GFSLT-VLP. "TS-SLT" is Two-stream network for sign language recognition and translation, which they cite as inspiration. The concept of taking nonlinear sign language and discretizing it into tokens that just "play nice" with LLMs and let you use all their pre-existing knowledge is intriguing. I'd be curious to learn more about the limitations.

I think I will go with Version 3 above as a start.

Also added the metric results on PapersWithCode. No Code though, sorry

PR: https://github.com/sign-language-processing/sign-language-processing.github.io/pull/39

cleong110 / sign-language-processing.github.io

LLMs are Good Sign Language Translators #7

Summary rewrite process:

Version 1:

Version 2:

Version 3: