cleong110 / sign-language-processing.github.io

Documentation and background of sign language processing
0 stars 0 forks source link

Towards Privacy-Aware Sign Language Translation at Scale #11

Closed cleong110 closed 5 months ago

cleong110 commented 5 months ago

Very interesting paper which does pretrain-then-finetune, with all the benefits that provides. Less need for data/annotations in the target language/task essentially

Writing/style:

cleong110 commented 5 months ago

Previously I wrote this summary:

In Rust et al's 2024 work \cite{rustPrivacyAwareSignLanguage2024}, they propose a self-supervised method based on Masked Auto-encoding as well as a new Linguistic-Supervised Pretraining, that makes no assumptions about model architecture. They use this in conjunction with a Hierarchichal transformer, pretrained on a number of large-scale sign language datasets including Youtube-ASL\cite{uthusYouTubeASLLargeScaleOpenDomain2023}, How2Sign\cite{duarteHow2SignLargeScaleMultimodal2021}, and a new dataset they release known as DailyMoth-70h. Results on How2Sign were significantly increased from previous SOTA such as \cite{tarres_sign_2023} and \cite{uthusYouTubeASLLargeScaleOpenDomain2023} ]\cite{linGlossFreeEndtoEndSign2023}

cleong110 commented 5 months ago

citation: since it's not published yet, arxiv is the way to go.

@misc{rust2024PrivacyAwareSign,
      title={Towards Privacy-Aware Sign Language Translation at Scale}, 
      author={Phillip Rust and Bowen Shi and Skyler Wang and Necati Cihan Camgöz and Jean Maillard},
      year={2024},
      eprint={2402.09611},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
cleong110 commented 5 months ago

Let's build our prompt!


I am writing a summary of an academic paper. 

STYLE GUIDE
- **Citations**: Use the format `@authorYearKeyword` for inline citations, and `[@authorYearKeyword]` for citations wrapped in parentheses. To include multiple citations ,use a semicolon (;) to separate them (e.g., "@authorYearKeyword;@authorYearKeyword").
- **Background & Related Work**: Use simple past tense to describe previous work (e.g., "@authorYearKeyword used...").
- **Abbreviations**: Define abbreviations in parentheses after the full term (e.g., Langue des Signes Française (LSF)).
- **Percentages**: Use the percent sign (%) with no space between the number and the sign (e.g., 95%).
- **Spacing**: Use a single space after periods and commas.
- **Hyphenation**: Use hyphens (-) for compound adjectives (e.g., video-to-pose).
- **Lists**: Use "-" for list items, followed by a space.
- **Code**: Use backticks (`) for inline code, and triple backticks (```) for code blocks.
- **Numbers**: Spell out numbers less than 10, and use numerals for 10 and greater.
- **Contractions**: Avoid contractions (e.g., use "do not" instead of "don't").
- **Compound Words**: Use a forward slash (/) to separate alternative compound words (e.g., 2D / 3D).
- **Phrasing**: Prefer active voice over passive voice (e.g., "The authors used..." instead of "The work was used by the authors...").
- **Structure**: Present information in a logical order.
- **Capitalization**: Capitalize the first word of a sentence, and proper nouns.
- **Emphasis**: Use italics for emphasis by wrapping a word with asterisks (e.g., *emphasis*).
- **Quote marks**: Use double quotes (").
- **Paragraphs**: When a subsection header starts with ######, add "{-}" to the end of the subsection title to indicate a new paragraph. If it starts with #, ##, ###, ####, or ##### do not add the "{-}".
- **Mathematics**: Use LaTeX math notation (e.g., $x^2$) wrapped in dollar signs ($).

All right, here is information about the paper I am trying to summarize:

Paper Title: "Towards Privacy-Aware Sign Language Translation at Scale"

Abstract: 
"A major impediment to the advancement of sign language translation (SLT) is data scarcity. Much of the sign language data currently available on the web cannot be used for training supervised models due to the lack of aligned captions. Furthermore, scaling SLT using large-scale web-scraped datasets bears privacy risks due to the presence of biometric information, which the responsible development of SLT technologies should account for. In this work, we propose a two-stage framework for privacy-aware SLT at scale that addresses both of these issues. We introduce SSVP-SLT, which leverages self-supervised video pretraining on anonymized and unannotated videos, followed by supervised SLT finetuning on a curated parallel dataset. SSVP-SLT achieves state-of-the-art finetuned and zero-shot gloss-free SLT performance on the How2Sign dataset, outperforming the strongest respective baselines by over 3 BLEU-4. Based on controlled experiments, we further discuss the advantages and limitations of self-supervised pretraining and anonymization via facial obfuscation for SLT."

Full Text: see attached PDF

My first version of the summary:  
"@rust2024PrivacyAwareSign introduce a privacy-aware method for sign language translation at scale which they call Self Supervised Video Pretraining for Sign Language Translation (SSVP-SLT). 
SSVP-SLT is a two-stage method: they first pretrain a vision transformer [@ryali2023HieraVisionTransformer] with a self-supervised task on large unannotated video datasets [@dataset:uthus2023YoutubeASL,@dataset:duarte2020how2sign]. 
In the second stage they freeze their vision model and project its outputs into a multingual LLM (T5; @raffel2020T5Transformer), which they finetune for translation on the How2Sign dataset [@dataset:duarte2020how2sign]. 
They address privacy concerns by face-blurring during training. 
They release their pretrained vision model, SignHiera, based on a Hiera vision transformer [@ryali2023HieraVisionTransformer]. 
In addition they release a new dataset they call DailyMoth-70h, containing video data from the Daily Moth, a Deaf News site. 
The model achieves state-of-the-art results on the How2Sign dataset [@dataset:duarte2020how2sign]."

Based on what I have provided, can you rewrite my first version of the summary to be more concise and professional? Please provide 3 alternative rewrites, and explain your suggested changes, as well as any issues with writing quality or inaccuracy in my original summary. Be sure the summaries you provide are accurate to the figure and the the abstract. If I have missed a key contribution from the paper please note that and suggest additions. If something is not clear request clarification and I can provide additional snippets. Please cite your sources for important details, e.g. "from the abstract" or "based on the full text". My summary is in markdown syntax and contains citations to a BibTex bibliography, the citations begin with "@". Please use the same citation style.

In addition, please follow the style guide I provided in each of the 3 rewrites.
cleong110 commented 5 months ago

Colin's Commentary: big kudos for mentioning how they calculate BLEU (SacreBLEU). isn't blurring reversible? https://news.ycombinator.com/item?id=23371351-

cleong110 commented 5 months ago

Branch: https://github.com/cleong110/sign-language-processing.github.io/tree/paper/rust2024PrivacyAwareSign

cleong110 commented 5 months ago

My second version of the summary, without any ChatGPT input "@rust2024PrivacyAwareSign introduce a privacy-aware method for sign language translation at scale which they call Self Supervised Video Pretraining for Sign Language Translation (SSVP-SLT). SSVP-SLT is a two-stage method: they first pretrain a vision transformer [@ryali2023HieraVisionTransformer] with a self-supervised task on large unannotated video datasets [@dataset:uthus2023YoutubeASL,@dataset:duarte2020how2sign]. In the second stage they freeze their vision model and project its outputs into a multingual LLM (T5; @raffel2020T5Transformer), which they finetune for translation on the How2Sign dataset [@dataset:duarte2020how2sign]. They address privacy concerns by face-blurring during training. They release their pretrained vision model, SignHiera, based on a Hiera vision transformer [@ryali2023HieraVisionTransformer]. In addition they release a new dataset they call DailyMoth-70h, containing video data from the Daily Moth, a Deaf News site. The model achieves state-of-the-art results on the How2Sign dataset [@dataset:duarte2020how2sign]."

cleong110 commented 5 months ago

Conversation with ChatGPT: https://chatgpt.com/share/48910d3d-458a-4602-9bd2-25ea559818c9. It provided some suggestions

cleong110 commented 5 months ago

Fixing a few issues (I'm actually not sure they will release models) and synthesizing a bit, we get:

@rust2024PrivacyAwareSign introduce a two-stage privacy-aware method for sign language translation (SLT) at scale, termed Self-Supervised Video Pretraining for Sign Language Translation (SSVP-SLT). The first stage involves self-supervised pretraining of a Hiera vision transformer on large unannotated video datasets [@ryali2023HieraVisionTransformer; @dataset:uthus2023YoutubeASL]. In the second stage, the vision model's outputs are fed into a multilingual language model (T5) for finetuning on the How2Sign dataset [@raffel2020T5Transformer; @dataset:duarte2020how2sign]. To mitigate privacy risks, the framework employs facial obfuscation. Additionally, the authors release DailyMoth-70h, a new 70-hour ASL dataset from The Daily Moth. SSVP-SLT achieves state-of-the-art performance on How2Sign [@dataset:duarte2020how2sign].

cleong110 commented 5 months ago

Merged!