Open zhyncs opened 3 days ago
nit: We may also need to upgrade flash attention when we use torch 2.3.0
https://github.com/Dao-AILab/flash-attention/releases/tag/v2.5.9.post1
@grimoire said torch2.3.0 + triton 2.3.0 degrade the performance speed comparing the torch2.2.2 + triton 2.2.0. PR #1499 shows the speed decreases about 8% in throughput if torch is upgraded to 2.3.0 That's why I decided not to upgrade it.
torch2.3.0 + triton 2.3.0 degrade the performance speed comparing the torch2.2.2 + triton 2.2.0
Okay, I'll take a look to verify and see if this issue still exists in the latest main branch code.
triton 2.3.0 takes more time to perform the kernel launch (check device/stream, generate cache key, etc). Models with more GPU computation might suffer less from it.
Motivation
current range https://github.com/InternLM/lmdeploy/blob/a06174f836882d853d4eb18519c2245c2a7eae8c/requirements/runtime.txt#L16
vLLM latest requirement
https://github.com/vllm-project/vllm/blob/515080ad2fd93cc8e363ff43b90a9df18cfd71ff/requirements-cuda.txt#L7
In order to install vLLM and LMDeploy in the same image, I upgraded the torch version to 2.3.0 and used the
--no-deps
parameter when installing LMDeploy.In order to verify the impact of upgrading the torch version on the performance of LMDeploy PyTorch Engine, I conducted a simple benchmark.
From the results, it can be seen that after updating to torch 2.3.0, the performance of PyTorch Engine is still within a reasonable range.
May we considered expanding the version range of torch in LMDeploy to 2.3.0? @grimoire @lvhan028
env
Related resources
No response
Additional context
No response