Open josephrocca opened 4 months ago
Hi. I currently do not have a 40g A100 in hand. Can I check if this issue can also be reproduced with 80g instead?
I also encountered the same problem。when prefill n_token > 2048。 After making the following modifications, I can now work normally
src/turbomind/utils/allocator.h
I also encountered the same problem。when prefill n_token > 2048。 After making the following modifications, I can now work normally
src/turbomind/utils/allocator.h
cc @irexyc
I also encountered the same problem。when prefill n_token > 2048。 After making the following modifications, I can now work normally src/turbomind/utils/allocator.h
cc @irexyc
I forgot to add a message that I am not running on NV GPU。What specific hardware is not convenient to disclose。 I hope this information can be helpful。
@zhyncs I just tested this, and yes, surprisingly1 the bug is reproducible on an A100 80G too. I used the exact same commands as I mentioned above, and hit the same OOM error.
1 I guessed that it would only be reproducible on ~48GB GPUs like A40, L40, etc - due to 70B Llama2 being a "tight fit". But it seems it's not related to the fraction of the GPU's VRAM taken by the model params.
Ok. I'll take a look today.
Hi @josephrocca May you try this https://github.com/zhyncs/lmdeploy-build/releases/tag/aa07f92
Hi, for some reason I got an immediate OOM without any requests being served. But I'm not sure if I did something wrong when installing the whl - I'm not a Python dev, and had to talk to Claude/ChatGPT about changing the filenames so that pip would accept it as a valid whl file.
Ideally there would be a Docker tag that I could test, since then I can just paste it in Runpod for several different machines within a few seconds and easily test them all. I think you didn't appreciate how inexperienced I am here - apologies :sweat_smile: I am a humble web developer.
I'm sorry for any inconvenience. If it's convenient, could you please let me know your CUDA version and Python version? Thanks.
I used openmmlab/lmdeploy:v0.5.0
(which has Python version 3.8 IIRC) on a 2x3090 Runpod machine and I think nvidia-smi
said CUDA version 12.4. But i downloaded the 11.8 nightly whl from the page you linked because IIRC this is the one used in openmmlab/lmdeploy:v0.5.0
, which should work due to forwards-compatibility? And Claude AI told me to use pip
with --force-reinstall
to install the nightly whl over the original.
If it works fine for you, then I likely made some sort of mistake during install, so this issue can likely be safely closed, and I'll re-open if needed when testing the next released version.
pip3 install https://github.com/zhyncs/lmdeploy-build/releases/download/49208aa/lmdeploy-0.5.0+cu121+49208aa-cp38-cp38-manylinux2014_x86_64.whl --force-reinstall --no-deps
May you try this? Thanks. And to eliminate environmental issues, you may consider starting a new docker.
pip
doesn't like that URL due to the +cu121+49208aa
part but I removed that like I did previously (following Claude AI's advice). I tried using openmmlab/lmdeploy:v0.5.0
first, but that failed, saying something like "turbomind was not installed correctly, falling back to pytorch backend". Then I tried nvcr.io/nvidia/tritonserver:22.12-py3
since it is the base image used in the docker file in this repo. It errored with No module named 'transformers'
. So then I tried installing some stuff that the Dockerfile installs:
rm /etc/apt/sources.list.d/cuda*.list && apt-get update && apt-get install -y --no-install-recommends \
rapidjson-dev libgoogle-glog-dev gdb python3.8-venv \
&& rm -rf /var/lib/apt/lists/* && cd /opt && python3 -m venv py38
python3 -m pip install --no-cache-dir --upgrade pip setuptools==69.5.1 &&\
python3 -m pip install --no-cache-dir torch==2.1.0 torchvision==0.16.0 --index-url https://download.pytorch.org/whl/cu118 &&\
python3 -m pip install --no-cache-dir cmake packaging wheel
But I still got No module named 'transformers'
. So I tried removing --no-deps
from the command you gave (maybe I should have done this at the start), and then it installed and ran correctly. But then when I ran it and did the 30 concurrent requests as mentioned in my original post, I got the same error:
what(): [TM][ERROR] CUDA runtime error: out of memory /lmdeploy/src/turbomind/utils/allocator.h:231
Again, I'm not sure if this is because I did something wrong during install.
Honestly I think I am not the person best suited to trying to set this up because I don't really understand it - it takes me a long time (more than an hour to do this) and I end up mostly confused :sweat_smile: I'm not in a hurry to upgrade from 0.4.2 so I will wait for the next release, and give feedback on that. Please feel free to close this issue if you do not personally observe the problem after trying the steps that I have reported in the original post.
Thank you for your attempt and response, I'll try to replicate it again.
Checklist
Describe the bug
The
v0.4.2
official docker image could handle many concurrent requests without crashing, butv0.5.0
cannot. It crashes with:Right at the moment that requests start failing this is what the end of the DEBUG logs look like:
And here's the last 10k lines of the DEBUG logs at the moment the
CUDA runtime error: out of memory
line appears:Reproduction
I reproduced this on a 1xA40 machine and a 2x4090 machine. Both work fine with
openmmlab/lmdeploy:v0.4.2
, and both fail withopenmmlab/lmdeploy:v0.5.0
.Command:
Send 30 concurrent requests - note that in the code blow I add
i
at the start of theprompt
to prevent prefix caching. If you don't do that, then it doesn't crash.Environment
Error traceback
Last 10k lines: https://gist.github.com/josephrocca/3686c80f508a939dcf14c598b55db2b3
Last 300 lines: