Bump transformers from 4.33.1 to 4.37.1

Bumps transformers from 4.33.1 to 4.37.1.

Release notes

Patch release: v4.37.1

A patch release to resolve import errors from removed custom types in generation utils

Add back in generation types #28681

v4.37 Qwen2, Phi-2, SigLIP, ViP-LLaVA, Fast2SpeechConformer, 4-bit serialization, Whisper longform generation

Model releases

Qwen2

Qwen2 is the new model series of large language models from the Qwen team. Previously, the Qwen series was released, including Qwen-72B, Qwen-1.8B, Qwen-VL, Qwen-Audio, etc.

Qwen2 is a language model series including decoder language models of different model sizes. For each size, we release the base language model and the aligned chat model. It is based on the Transformer architecture with SwiGLU activation, attention QKV bias, group query attention, mixture of sliding window attention and full attention, etc. Additionally, we have an improved tokenizer adaptive to multiple natural languages and codes.

Add qwen2 by @JustinLin610 in #28436

Phi-2

Phi-2 is a transformer language model trained by Microsoft with exceptionally strong performance for its small size of 2.7 billion parameters. It was previously available as a custom code model, but has now been fully integrated into transformers.

[Phi2] Add support for phi2 models by @susnato in #28211

[Phi] Extend implementation to use GQA/MQA. by @gugarosa in #28163

update docs to add the phi-2 example by @susnato in #28392

Fixes default value of softmax_scale in PhiFlashAttention2. by @gugarosa in #28537

SigLIP

The SigLIP model was proposed in Sigmoid Loss for Language Image Pre-Training by Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, Lucas Beyer. SigLIP proposes to replace the loss function used in CLIP by a simple pairwise sigmoid loss. This results in better performance in terms of zero-shot classification accuracy on ImageNet.

Add SigLIP by @NielsRogge in #26522

[SigLIP] Don't pad by default by @NielsRogge in #28578

ViP-LLaVA

The VipLlava model was proposed in Making Large Multimodal Models Understand Arbitrary Visual Prompts by Mu Cai, Haotian Liu, Siva Karthik Mustikovela, Gregory P. Meyer, Yuning Chai, Dennis Park, Yong Jae Lee.

VipLlava enhances the training protocol of Llava by marking images and interact with the model using natural cues like a “red bounding box” or “pointed arrow” during training.

Adds VIP-llava to transformers by @younesbelkada in #27932

Fix Vip-llava docs by @younesbelkada in #28085

FastSpeech2Conformer

The FastSpeech2Conformer model was proposed with the paper Recent Developments On Espnet Toolkit Boosted By Conformer by Pengcheng Guo, Florian Boyer, Xuankai Chang, Tomoki Hayashi, Yosuke Higuchi, Hirofumi Inaguma, Naoyuki Kamo, Chenda Li, Daniel Garcia-Romero, Jiatong Shi, Jing Shi, Shinji Watanabe, Kun Wei, Wangyou Zhang, and Yuekai Zhang.

FastSpeech 2 is a non-autoregressive model for text-to-speech (TTS) synthesis, which develops upon FastSpeech, showing improvements in training speed, inference speed and voice quality. It consists of a variance adapter; duration, energy and pitch predictor and waveform and mel-spectrogram decoder.

Add FastSpeech2Conformer by @connor-henderson in #23439

Wav2Vec2-BERT

... (truncated)

Commits

d02d006 Release: v4.37.1
b102ab2 Add back in generation types (#28681)
8e3e145 [GPTNeoX] Fix BC issue with 4.36 (#28602)
344943b Fix _speculative_sampling implementation (#28508)
5fc3e60 [SigLIP] Don't pad by default (#28578)
5ee9fcb Fix wrong xpu device in DistributedType.MULTI_XPU mode (#28386)
e156abd [Whisper] Finalize batched SOTA long-form generation (#27658)
a485e46 Add w2v2bert to pipeline (#28585)
d381d85 Release: v4.37.0
db9a7e9 Don't save processor_config.json if a processor has no extra attribute (#2...
Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

String-sg / ai-starter-kit

Bump transformers from 4.33.1 to 4.37.1 #71

Patch release: v4.37.1

v4.37 Qwen2, Phi-2, SigLIP, ViP-LLaVA, Fast2SpeechConformer, 4-bit serialization, Whisper longform generation

Model releases

Qwen2

Phi-2

SigLIP

ViP-LLaVA

FastSpeech2Conformer

Wav2Vec2-BERT