Open cody-moveworks opened 1 year ago
me too, so you need to use triton to deploy it, not python runtime.
@cody-moveworks
Thanks for summarizing the concrete steps of reproducing your issue.
As @Tlntin said, you can try with the C++ Runtime firstly to see whether the OOM issue still exist.
In the meanwhile, we will also start the investigation of this issue.
Thanks June
Any progress here? BTW, when will the 0.6.0 will be released?
me too, so you need to use triton to deploy it, not python runtime.
Could you please elaborate on why triton doesn't have this issue? Does triton deploy c++ runtime? Thanks!
Could you please elaborate on why triton doesn't have this issue
i don't know why.
Does triton deploy c++ runtime?
yes
Any update?
still see the issues with python benchmark. when disable paged_kv_cache can run llama2-7b with bs=32, in=128, out=2048, while enable paged_kv_cache give oom.
Opening a new issue as #237 was closed prematurely.
It seems that engines built using the
--paged_kv_cache
flag leak GPU memory. Below is a minimal reproducible example code that can be used to trigger a GPU out-of-memory error. TheENGINE_DIR
andTOKENIZER_DIR
variables should be changed accordingly.I am running my tests in a Docker container running an image built using the Dockerfile provided by this GitHub repo (i.e. by running
make -C docker release_build
). I am using an NVIDIA A100 40G GPU for my tests. I used Meta's pre-trained CodeLlama-7b-instruct model for the tests. The/tmp/CodeLlama/7B/hf
is a directory containing both the Hugging Face PyTorch model and tokenizer files for the model.We first convert the Hugging Face PyTorch model checkpoint to a FasterTransformer model checkpoint to prepare it for engine building with SmoothQuant and int8 KV cache features enabled:
If the engine is built with this command, then running the script above does not result in GPU OOM.
If you add
--paged_kv_cache
flag when building the engine, then running the script above leads to GPU OOM.