Open anaivebird opened 1 month ago
gpu memory leak when max_tokens = 1
Can you try it without gather_all_token_logits
?
For the case with gather_all_token_logits
, we need to investigate it.
Thanks, to the best of my memory, without gather_all_token_logits
, it works well.
Got it, it may be the issue with gather_all_token_logits
, we will reproduce it and investigate, thanks for sharing the issue.
Hi @anaivebird , thanks for reporting this issue, and I can reproduce it from my side.
Logits tensor takes a lot of memory. In your case, let's say the context length is 300, the vocab size is 151851, so each context logits tensor takes 300 * 151851 * 4 = 182221200 bytes = 0.17 GB memory. Note the logits is float32 dtype.
The model is 7B, so its weights take 7 * 2 = 14 GB memory. Let's say the activation memory and runtime buffers take additional 1GB memory. The free_gpu_memory_fraction is 0.8 by default in openai_server.py, so kv cache pool takes (80-14-1) * 0.8 = 52 GB memory. So, the remaining memory is 80 - 14 - 1 - 52 = 13 GB, which can hold 13 / 0.17 = 76.5 requests' logits tensors.
In my experiments, it works if I set --max_batch_size 64
when calling trtllm-build
. Could you please try with this? Alternatively, you may use a smaller value for free_gpu_memory_fraction in openai_server.py, which allows a larger max_batch_size
.
cc @yweng0828 for viz.
Thanks!
It seems necessary to reserve the corresponding memory based on the vocabulary size prompt token length max_batch_size, right?
It seems necessary to reserve the corresponding memory based on the vocabulary size prompt token length max_batch_size, right?
Yes. But currently, we don't reserve the maximum size of logits like activation memory and other runtime buffers, because logits typically take too much memory.
Hi @anaivebird , thanks for reporting this issue, and I can reproduce it from my side.
Logits tensor takes a lot of memory. In your case, let's say the context length is 300, the vocab size is 151851, so each context logits tensor takes 300 151851 4 = 182221200 bytes = 0.17 GB memory. Note the logits is float32 dtype.
The model is 7B, so its weights take 7 2 = 14 GB memory. Let's say the activation memory and runtime buffers take additional 1GB memory. The free_gpu_memory_fraction is 0.8 by default in openai_server.py, so kv cache pool takes (80-14-1) 0.8 = 52 GB memory. So, the remaining memory is 80 - 14 - 1 - 52 = 13 GB, which can hold 13 / 0.17 = 76.5 requests' logits tensors.
In my experiments, it works if I set
--max_batch_size 64
when callingtrtllm-build
. Could you please try with this? Alternatively, you may use a smaller value for free_gpu_memory_fraction in openai_server.py, which allows a largermax_batch_size
.cc @yweng0828 for viz.
Thanks!
Shown on https://github.com/NVIDIA/TensorRT-LLM/issues/2350, change free_gpu_memory_fraction does not increase gpu free memory, even change to 0.1 has no change.
System Info
Who can help?
@byshiue @juney-nvidia @ncomly-nvidia
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
1000 requests should finished normally
actual behavior
additional notes
no