NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.18k stars 908 forks source link

Illegal memory access when medium batch sizes on using bad_words #427

Open 0xymoro opened 10 months ago

0xymoro commented 10 months ago

Building on main and still same issue with medium batchsizes (20) freezing entire engine. This time I got some logs. This is a llama 2 70b model. Strange issue that pops up when bad_words is used.

It was built with pretty standard settings after the standard fp8 quantization for llama 2: python build.py --model_dir /mnt/nvme/v2.1 --quantized_fp8_model_path /mnt/nvme/v2.1-fp8-npz/llama_tp1_rank0.npz --dtype float16 --use_gpt_attention_plugin float16 --use_gemm_plugin float16 --output_dir /mnt/nvme/v2.1-trtllm-fp8-tp4 --remove_input_padding --use_inflight_batching --paged_kv_cache --enable_context_fmha --max_input_len 4096 --max_batch_size 128 --strongly_typed --enable_fp8 --fp8_kv_cache --world_size 4 --tp_size 4 --parallel_build

Running on 4x H100s, CUDA 12.2. Built TRTLLM & Triton backend all from latest in main branch as of current.

It was ran with a string around 3000 tokens, curl'd with 20 of them in parallel. Bad words was set to \ as there's an issue with llama not recognizing end token or the sort.

[TensorRT-LLM][ERROR] Encountered an error in forward function: [TensorRT-LLM][ERROR] CUDA runtime error in cudaMemcpyAsync(tgt, src, sizeof(T) size, cudaMemcpyDefault, stream): an illegal memory access was encountered (/app/tensorrt_llm/cpp/tensorrt_llm/common/memoryUtils.cu:211) 1 0x7fb397c90a46 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x15aa46) [0x7fb397c90a46] 2 0x7fb397c79ff0 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x143ff0) [0x7fb397c79ff0] 3 0x7fb397c5e0df /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x1280df) [0x7fb397c5e0df] 4 0x7fb397c43c62 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x10dc62) [0x7fb397c43c62] 5 0x7fb397bdf0a2 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0xa90a2) [0x7fb397bdf0a2] 6 0x7fb397b9c020 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x66020) [0x7fb397b9c020] 7 0x7fb397b9cbfb /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x66bfb) [0x7fb397b9cbfb] 8 0x7fb397ba00bd /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x6a0bd) [0x7fb397ba00bd] 9 0x7fb397b8df11 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x57f11) [0x7fb397b8df11] 10 0x7fb397b909b2 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x5a9b2) [0x7fb397b909b2] 11 0x7fb682e64253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fb682e64253] 12 0x7fb682bf4ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fb682bf4ac3] 13 0x7fb682c86a40 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126a40) [0x7fb682c86a40] [TensorRT-LLM][ERROR] Encountered error for requestId 1315634023: Encountered an error in forward function: [TensorRT-LLM][ERROR] CUDA runtime error in cudaMemcpyAsync(tgt, src, sizeof(T) size, cudaMemcpyDefault, stream): an illegal memory access was encountered (/app/tensorrt_llm/cpp/tensorrt_llm/common/memoryUtils.cu:211) 1 0x7fb397c90a46 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x15aa46) [0x7fb397c90a46] 2 0x7fb397c79ff0 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x143ff0) [0x7fb397c79ff0] 3 0x7fb397c5e0df /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x1280df) [0x7fb397c5e0df] 4 0x7fb397c43c62 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x10dc62) [0x7fb397c43c62] 5 0x7fb397bdf0a2 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0xa90a2) [0x7fb397bdf0a2] 6 0x7fb397b9c020 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x66020) [0x7fb397b9c020] 7 0x7fb397b9cbfb /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x66bfb) [0x7fb397b9cbfb] 8 0x7fb397ba00bd /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x6a0bd) [0x7fb397ba00bd] 9 0x7fb397b8df11 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x57f11) [0x7fb397b8df11] 10 0x7fb397b909b2 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x5a9b2) [0x7fb397b909b2] 11 0x7fb682e64253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fb682e64253] 12 0x7fb682bf4ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fb682bf4ac3] 13 0x7fb682c86a40 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126a40) [0x7fb682c86a40] [TensorRT-LLM][ERROR] tensorrt_llm::common::TllmException: [TensorRT-LLM][ERROR] CUDA runtime error in ::cudaFreeHost(ptr): an illegal memory access was encountered (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmBuffers.h:140)

0xymoro commented 10 months ago

It seems like stop words also trigger this, nondeterministic.

When end_id = 2 is used, or when \ is used as a stop word, it doesn't trigger the memory issue above, but rather freezes silently.

It's looking like there's an underlying issue with stopping logic on a low level for larger batch sizes like 15-20 that I'm stress testing it with.

byshiue commented 9 months ago

Can you reproduce this issue on smaller model to help debugging?

0xymoro commented 9 months ago

@byshiue if you want I can privately share you the fp8 engine we built for it, ~70 gb and you can run it directly along with the TRT backend settings set. Can email me at zmeng90@alumni.stanford.edu and will send it over, thanks.

byshiue commented 9 months ago

Hi, @0xymoro . I think it would be simpler to find a smaller model to reproduce the issue first, preventing we must find a multi-GPU node to reproduce this issue and debug under multi-gpu case, which leads to more complicated.

0xymoro commented 9 months ago

From #448 it seems like it may be an issue with tensor parallelism. Do you have a small llama with TP 4 and see if it runs into the same thing? If not it may be 70b issues.

jdemouth-nvidia commented 9 months ago

@0xymoro can you verify if the issue persists with Llama-v2-7B and TP=2 or TP=4, please? Also, are you using the main branch?

0xymoro commented 9 months ago

@jdemouth-nvidia @byshiue I am using the main branch, still the latest as of now, built the TRTLLM & Triton backend images from source last week.

Right now am crunched for time (am fulltime on my startup), and we're using all available GPUs for inference right now so I'll need to find some gap there. I can't promise timeline on being able to run building for 7b many engines and tests, sorry I can't be more helpful but I only have very limited bandwidth to run all these tests and building, but hopefully your engineers have more time to devote on debugging this.

Also, if you give me a setup on how it's NOT replicating for your engineers I can cross reference it and see what parts are different and potentially isolate the problem a lot.

I do think this should be top priority for TRTLLM to fix as a whole as it renders 70b llama (and potentially all llama if it replicates on 7b at those TPs) unusable for anything but a small demo, we were planning on using TRTLLM for production but this issue is preventing it so currently we're using our old HF TGI setup.

0xymoro commented 9 months ago

temp5.txt

If it's helpful, here is the text file I used to curl the endpoint. With bad words and stop words set it's bound to either 1.) freeze or 2.) show the illegal memory access.

This is the command I used to stress test it with 20 in parallel:

printf '%s\n' {1..20} | xargs -I % -P 20 curl -kX POST https://<<>>/v2/models/ensemble/generate -d @temp5.txt

byshiue commented 9 months ago

I have tried to reproduce on llama 7b but fail, I will try on 70b when I have time and get the resources.

0xymoro commented 8 months ago

Update on this. Tried TP2, and no longer get any "illegal memory access" errors. However, when end_id = 2 is supplied from client side, the system will freeze indefinitely at high context + large batchsizes (20).

Removing end_id made it work, I tried 500 requests at 20 in parallel and none of it froze.

I can make a separate issue for end_id but it may also have to do with the older version that I built the engine with, but can confirm TP2 doesn't have the same behavior as TP4 in terms of memory access errors (at least not explicitly).

focusunsink commented 2 months ago

maybe I faced the same problem when I use my fp8 model