Open 0xymoro opened 10 months ago
It seems like stop words also trigger this, nondeterministic.
When end_id = 2 is used, or when \ is used as a stop word, it doesn't trigger the memory issue above, but rather freezes silently.
It's looking like there's an underlying issue with stopping logic on a low level for larger batch sizes like 15-20 that I'm stress testing it with.
Can you reproduce this issue on smaller model to help debugging?
@byshiue if you want I can privately share you the fp8 engine we built for it, ~70 gb and you can run it directly along with the TRT backend settings set. Can email me at zmeng90@alumni.stanford.edu and will send it over, thanks.
Hi, @0xymoro . I think it would be simpler to find a smaller model to reproduce the issue first, preventing we must find a multi-GPU node to reproduce this issue and debug under multi-gpu case, which leads to more complicated.
From #448 it seems like it may be an issue with tensor parallelism. Do you have a small llama with TP 4 and see if it runs into the same thing? If not it may be 70b issues.
@0xymoro can you verify if the issue persists with Llama-v2-7B and TP=2 or TP=4, please? Also, are you using the main
branch?
@jdemouth-nvidia @byshiue I am using the main branch, still the latest as of now, built the TRTLLM & Triton backend images from source last week.
Right now am crunched for time (am fulltime on my startup), and we're using all available GPUs for inference right now so I'll need to find some gap there. I can't promise timeline on being able to run building for 7b many engines and tests, sorry I can't be more helpful but I only have very limited bandwidth to run all these tests and building, but hopefully your engineers have more time to devote on debugging this.
Also, if you give me a setup on how it's NOT replicating for your engineers I can cross reference it and see what parts are different and potentially isolate the problem a lot.
I do think this should be top priority for TRTLLM to fix as a whole as it renders 70b llama (and potentially all llama if it replicates on 7b at those TPs) unusable for anything but a small demo, we were planning on using TRTLLM for production but this issue is preventing it so currently we're using our old HF TGI setup.
If it's helpful, here is the text file I used to curl the endpoint. With bad words and stop words set it's bound to either 1.) freeze or 2.) show the illegal memory access.
This is the command I used to stress test it with 20 in parallel:
printf '%s\n' {1..20} | xargs -I % -P 20 curl -kX POST https://<<
I have tried to reproduce on llama 7b but fail, I will try on 70b when I have time and get the resources.
Update on this. Tried TP2, and no longer get any "illegal memory access" errors. However, when end_id = 2 is supplied from client side, the system will freeze indefinitely at high context + large batchsizes (20).
Removing end_id made it work, I tried 500 requests at 20 in parallel and none of it froze.
I can make a separate issue for end_id but it may also have to do with the older version that I built the engine with, but can confirm TP2 doesn't have the same behavior as TP4 in terms of memory access errors (at least not explicitly).
maybe I faced the same problem when I use my fp8 model
Building on main and still same issue with medium batchsizes (20) freezing entire engine. This time I got some logs. This is a llama 2 70b model. Strange issue that pops up when bad_words is used.
It was built with pretty standard settings after the standard fp8 quantization for llama 2: python build.py --model_dir /mnt/nvme/v2.1 --quantized_fp8_model_path /mnt/nvme/v2.1-fp8-npz/llama_tp1_rank0.npz --dtype float16 --use_gpt_attention_plugin float16 --use_gemm_plugin float16 --output_dir /mnt/nvme/v2.1-trtllm-fp8-tp4 --remove_input_padding --use_inflight_batching --paged_kv_cache --enable_context_fmha --max_input_len 4096 --max_batch_size 128 --strongly_typed --enable_fp8 --fp8_kv_cache --world_size 4 --tp_size 4 --parallel_build
Running on 4x H100s, CUDA 12.2. Built TRTLLM & Triton backend all from latest in main branch as of current.
It was ran with a string around 3000 tokens, curl'd with 20 of them in parallel. Bad words was set to \
as there's an issue with llama not recognizing end token or the sort.[TensorRT-LLM][ERROR] Encountered an error in forward function: [TensorRT-LLM][ERROR] CUDA runtime error in cudaMemcpyAsync(tgt, src, sizeof(T) size, cudaMemcpyDefault, stream): an illegal memory access was encountered (/app/tensorrt_llm/cpp/tensorrt_llm/common/memoryUtils.cu:211) 1 0x7fb397c90a46 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x15aa46) [0x7fb397c90a46] 2 0x7fb397c79ff0 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x143ff0) [0x7fb397c79ff0] 3 0x7fb397c5e0df /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x1280df) [0x7fb397c5e0df] 4 0x7fb397c43c62 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x10dc62) [0x7fb397c43c62] 5 0x7fb397bdf0a2 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0xa90a2) [0x7fb397bdf0a2] 6 0x7fb397b9c020 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x66020) [0x7fb397b9c020] 7 0x7fb397b9cbfb /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x66bfb) [0x7fb397b9cbfb] 8 0x7fb397ba00bd /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x6a0bd) [0x7fb397ba00bd] 9 0x7fb397b8df11 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x57f11) [0x7fb397b8df11] 10 0x7fb397b909b2 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x5a9b2) [0x7fb397b909b2] 11 0x7fb682e64253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fb682e64253] 12 0x7fb682bf4ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fb682bf4ac3] 13 0x7fb682c86a40 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126a40) [0x7fb682c86a40] [TensorRT-LLM][ERROR] Encountered error for requestId 1315634023: Encountered an error in forward function: [TensorRT-LLM][ERROR] CUDA runtime error in cudaMemcpyAsync(tgt, src, sizeof(T) size, cudaMemcpyDefault, stream): an illegal memory access was encountered (/app/tensorrt_llm/cpp/tensorrt_llm/common/memoryUtils.cu:211) 1 0x7fb397c90a46 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x15aa46) [0x7fb397c90a46] 2 0x7fb397c79ff0 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x143ff0) [0x7fb397c79ff0] 3 0x7fb397c5e0df /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x1280df) [0x7fb397c5e0df] 4 0x7fb397c43c62 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x10dc62) [0x7fb397c43c62] 5 0x7fb397bdf0a2 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0xa90a2) [0x7fb397bdf0a2] 6 0x7fb397b9c020 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x66020) [0x7fb397b9c020] 7 0x7fb397b9cbfb /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x66bfb) [0x7fb397b9cbfb] 8 0x7fb397ba00bd /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x6a0bd) [0x7fb397ba00bd] 9 0x7fb397b8df11 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x57f11) [0x7fb397b8df11] 10 0x7fb397b909b2 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x5a9b2) [0x7fb397b909b2] 11 0x7fb682e64253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fb682e64253] 12 0x7fb682bf4ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fb682bf4ac3] 13 0x7fb682c86a40 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126a40) [0x7fb682c86a40] [TensorRT-LLM][ERROR] tensorrt_llm::common::TllmException: [TensorRT-LLM][ERROR] CUDA runtime error in ::cudaFreeHost(ptr): an illegal memory access was encountered (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmBuffers.h:140)