chore(deps): bump vllm from 0.4.0 to 0.4.1 in /openllm-python

Bumps vllm from 0.4.0 to 0.4.1.

Release notes

v0.4.1

Highlights

Features

Support and enhance CommandR+ (#3829), minicpm (#3893), Meta Llama 3 (#4175, #4182), Mixtral 8x22b (#4073, #4002)

Support private model registration, and updating our support policy (#3871, 3948)

Support PyTorch 2.2.1 and Triton 2.2.0 (#4061, #4079, #3805, #3904, #4271)

Add option for using LM Format Enforcer for guided decoding (#3868)

Add option for optionally initialize tokenizer and detokenizer (#3748)

Add option for load model using tensorizer (#3476)

Enhancements

vLLM is now mostly type checked by mypy (#3816, #4006, #4161, #4043)

Progress towards chunked prefill scheduler (#3550, #3853, #4280, #3884)

Progress towards speculative decoding (#3250, #3706, #3894)

Initial support with dynamic per-tensor scaling via FP8 (#4118)

Hardwares

Intel CPU inference backend is added (#3993, #3634)

AMD backend is enhanced with Triton kernel and e4m3fn KV cache (#3643, #3290)

What's Changed

[Kernel] Layernorm performance optimization by @mawong-amd in vllm-project/vllm#3662

[Doc] Update installation doc for build from source and explain the dependency on torch/cuda version by @youkaichao in vllm-project/vllm#3746

[CI/Build] Make Marlin Tests Green by @robertgshaw2-neuralmagic in vllm-project/vllm#3753

[Misc] Minor fixes in requirements.txt by @WoosukKwon in vllm-project/vllm#3769

[Misc] Some minor simplifications to detokenization logic by @njhill in vllm-project/vllm#3670

[Misc] Fix Benchmark TTFT Calculation for Chat Completions by @ywang96 in vllm-project/vllm#3768

[Speculative decoding 4/9] Lookahead scheduling for speculative decoding by @cadedaniel in vllm-project/vllm#3250

[Misc] Add support for new autogptq checkpoint_format by @Qubitium in vllm-project/vllm#3689

[Misc] [CI/Build] Speed up block manager CPU-only unit tests ~10x by opting-out of GPU cleanup by @cadedaniel in vllm-project/vllm#3783

[Hardware][Intel] Add CPU inference backend by @bigPYJ1151 in vllm-project/vllm#3634

[HotFix] [CI/Build] Minor fix for CPU backend CI by @bigPYJ1151 in vllm-project/vllm#3787

[Frontend][Bugfix] allow using the default middleware with a root path by @A-Mahla in vllm-project/vllm#3788

[Doc] Fix vLLMEngine Doc Page by @ywang96 in vllm-project/vllm#3791

[CI/Build] fix TORCH_CUDA_ARCH_LIST in wheel build by @youkaichao in vllm-project/vllm#3801

Fix crash when try torch.cuda.set_device in worker by @leiwen83 in vllm-project/vllm#3770

[Bugfix] Add __init__.py files for vllm/core/block/ and vllm/spec_decode/ by @mgoin in vllm-project/vllm#3798

[CI/Build] 0.4.0.post1, fix sm 7.0/7.5 binary by @youkaichao in vllm-project/vllm#3803

[Speculative decoding] Adding configuration object for speculative decoding by @cadedaniel in vllm-project/vllm#3706

[BugFix] Use different mechanism to get vllm version in is_cpu() by @njhill in vllm-project/vllm#3804

[Doc] Update README.md by @robertgshaw2-neuralmagic in vllm-project/vllm#3806

[Doc] Update contribution guidelines for better onboarding by @michaelfeil in vllm-project/vllm#3819

[3/N] Refactor scheduler for chunked prefill scheduling by @rkooo567 in vllm-project/vllm#3550

Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) by @AdrianAbeyta in vllm-project/vllm#3290

[Misc] Publish 3rd meetup slides by @WoosukKwon in vllm-project/vllm#3835

Fixes the argument for local_tokenizer_group by @sighingnow in vllm-project/vllm#3754

[Core] Enable hf_transfer by default if available by @michaelfeil in vllm-project/vllm#3817

[Bugfix] Add kv_scale input parameter to CPU backend by @WoosukKwon in vllm-project/vllm#3840

[Core] [Frontend] Make detokenization optional by @mgerstgrasser in vllm-project/vllm#3749

... (truncated)

Commits

468d761 [Misc] Reduce supported Punica dtypes (#4304)
e4bf860 [CI][Build] change pynvml to nvidia-ml-py (#4302)
91f50a6 [Core][Distributed] use cpu/gloo to initialize pynccl (#4248)
79a268c [BUG] fixed fp8 conflict with aqlm (#4307)
eace8bf [Kernel] FP8 support for MoE kernel / Mixtral (#4244)
1e8f425 [Bugfix][Frontend] Raise exception when file-like chat template fails to be o...
2b7949c AQLM CUDA support (#3287)
62b5166 [CI] Add ccache for wheel builds job (#4281)
d86285a [Core][Logging] Add last frame information for better debugging (#4278)
d87f39e [Bugfix] Add init_cached_hf_modules to RayWorkerWrapper (#4286)
Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

bentoml / OpenLLM

chore(deps): bump vllm from 0.4.0 to 0.4.1 in /openllm-python #969

v0.4.1

Highlights

What's Changed