Update transformers requirement from <4.44.0,>=4.38.0 to >=4.38.0,<4.46.0

Updates the requirements on transformers to permit the latest version.

Release notes

Llama 3.2, mllama, Qwen2-Audio, Qwen2-VL, OLMoE, Llava Onevision, Pixtral, FalconMamba, Modular Transformers

New model additions

mllama

The Llama 3.2-Vision collection of multimodal large language models (LLMs) is a collection of pretrained and instruction-tuned image reasoning generative models in 11B and 90B sizes (text + images in / text out). The Llama 3.2-Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The models outperform many of the available open source and closed multimodal models on common industry benchmarks.

Add MLLama #33703, by @qubvel, @zucchini-nlp, @ArthurZucker

Qwen2-VL

The Qwen2-VL is a major update from the previous Qwen-VL by the Qwen team.

An extract from the Qwen2-VL blogpost available here is as follows:

Qwen2-VL is the latest version of the vision language models based on Qwen2 in the Qwen model familities. Compared with Qwen-VL, Qwen2-VL has the capabilities of:

SoTA understanding of images of various resolution & ratio: Qwen2-VL achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc.

Understanding videos of 20min+: Qwen2-VL can understand videos over 20 minutes for high-quality video-based question answering, dialog, content creation, etc.

Agent that can operate your mobiles, robots, etc.: with the abilities of complex reasoning and decision making, Qwen2-VL can be integrated with devices like mobile phones, robots, etc., for automatic operation based on visual environment and text instructions.

Multilingual Support: to serve global users, besides English and Chinese, Qwen2-VL now supports the understanding of texts in different languages inside images, including most European languages, Japanese, Korean, Arabic, Vietnamese, etc.

support qwen2-vl by @simonJJJ in #32318

Qwen2-Audio

The Qwen2-Audio is the new model series of large audio-language models from the Qwen team. Qwen2-Audio is capable of accepting various audio signal inputs and performing audio analysis or direct textual responses with regard to speech instructions.

They introduce two distinct audio interaction modes:

voice chat: users can freely engage in voice interactions with Qwen2-Audio without text input

audio analysis: users could provide audio and text instructions for analysis during the interaction

Add Qwen2-Audio by @faychu in #32137

OLMoE

OLMoE is a series of Open Language Models using sparse Mixture-of-Experts designed to enable the science of language models. The team releases all code, checkpoints, logs, and details involved in training these models.

Add OLMoE by @Muennighoff in #32406

Llava Onevision

LLaVA-Onevision is a Vision-Language Model that can generate text conditioned on one or several images/videos. The model consists of SigLIP vision encoder and a Qwen2 language backbone. The images are processed with anyres-9 technique where the image is split into 9 patches to better process high resolution images and capture as much details as possible. However, videos are pooled to a total sequence length of 196 tokens each frame for more memory efficient computation. LLaVA-Onevision is available in three sizes: 0.5B, 7B and 72B and achieves remarkable performance on benchmark evaluations.

... (truncated)

Commits

2ef31de Release: v4.45.0
19d58d3 Add MLLama (#33703)
94f18cf Add OmDet-Turbo (#31843)
ade9e0f Corrected max number for bf16 in transformer/docs (#33658)
196d35c Add AdEMAMix optimizer (#33682)
61e98cb Add SDPA support for M2M100 (#33309)
68049b1 Fix Megatron-LM tokenizer path (#33344)
574a9e1 HFQuantizer implementation for compressed-tensors library (#31704)
7e638ef fix code quality after merge
06e27e3 [Pixtral] Improve docs, rename model (#33491)
Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

caikit / caikit-nlp

Update transformers requirement from <4.44.0,>=4.38.0 to >=4.38.0,<4.46.0 #391

Llama 3.2, mllama, Qwen2-Audio, Qwen2-VL, OLMoE, Llava Onevision, Pixtral, FalconMamba, Modular Transformers

New model additions

mllama

Qwen2-VL

Qwen2-Audio

OLMoE

Llava Onevision