NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.31k stars 932 forks source link

Tactic running out of memory during Code Llama 34B build #29

Closed michaelroyzen closed 11 months ago

michaelroyzen commented 11 months ago

On machines with either 8x A100-80GB or 8x H100, I'm getting many tactic out of memory issues during the build.

The tactic says it requesting 530000 MB while the GPU has 80GB, yet I only observe ~10GB in GPU memory utilization during the build.

Here is my script:

python build.py --model_dir ./Phind-CodeLlama-34B-v2 \
                --dtype bfloat16 \
                --remove_input_padding \
                --use_gpt_attention_plugin bfloat16 \
                --enable_context_fmha \
                --use_gemm_plugin bfloat16 \
                --paged_kv_cache \
                --use_parallel_embedding \
                --use_inflight_batching \
                --max_input_len 14848 \
                --max_output_len 1536 \
        --vocab_size 32000 \
                --rotary_base 1000000 \
                --output_dir ./Phind/Phind-CodeLlama-34B-v2/trt-engines/bf16/8-gpu \
                --world_size 8 \
                --tp_size 8 \
                --parallel_build

The same issue happens for much smaller input and output lens as well, which suggests that isn't the issue.

Phind-CodeLlama-34B is a standard 34B Code Llama that has been fine-tuned but is architecturally identical and is available here: https://huggingface.co/Phind/Phind-CodeLlama-34B-v2.

  1. Are these tactic errors resulting in a less optimized model? The model is still usable but it's slower than I expected.
  2. I also tried running with --builder_opt=5 for max optimizations but that model fails to load into the Triton backend completely

The documentation here could be improved -- I'd love to know what I can do to get the most optimized model possible @byshiue.

jdemouth commented 11 months ago

Hi @michaelroyzen , thanks for reporting this. In the build phase, TensorRT will try many different tactics and it’s ok if some of them fail due to OOM. It does not mean that the engine will be slower.

If the performance is not where you’d like it to be, I encourage you to share a command to reproduce the perf issue. You can use this GitHub issue or reach out directly and we’ll investigate together.

Regarding the documentation, if you have a concrete suggestion of how to improve it. It’s more than welcome. ;)

Thanks, Julien

michaelroyzen commented 11 months ago

Thank you, Julien. I've confirmed that the engine is not slower for when the tactics don't succeed.

However, I've noticed that decoding speed is significantly slower when a large context length is submitted. I'm able to get 75 tokens per second on 8x A100-80GB throughput with a 500-token input but only 50 tokens per second with a 5000-token input. The discrepancy is even worse when running with lower TP.

Shouldn't the sampling time be similar for both sequences? Is this a paged-attention limitation? What optimizations can I make to improve long-context decoding speed? @jdemouth

Best,

Michael

Also, what's the difference between max_batch_size in the engine builder and max_num_sequences in TrtGptModelOptionalParams?

jdemouth-nvidia commented 11 months ago

Hi @michaelroyzen ,

Correct me if I'm wrong but I think the attention has more work to do when the input is longer. During the generation/decoding phase, instead of computing the dot products with 500 past K/V vectors, you have to do it with 5,000. It's 10x more memory traffic. The impact of attention will increase and affect the performance. Do you agree with that?

Regarding your other question and the difference between max_batch_size and max_num_sequences, I've posted an answer in issue 65.

Thanks, Julien

michaelroyzen commented 11 months ago

Thank you, Julien.

I believe that longer context lengths result in a compute-bound workload as opposed to a memory-bound workload from a KV cache perspective. However, only ~160/400W per GPU are being utilized when running TP=8 on 8x A100s that have all-to-all NVLink. I suspect that there might be additional compute/memory optimizations that can be made to optimize workloads with long context lengths.

Best,

Michael

jdemouth-nvidia commented 11 months ago

The more input tokens, the more likely the context/prompt phase will be compute bound (as we have bigger GEMMs). However, the generation/decode phase remains memory-bound as you process only one token at a time (for batch size 1) and a lot of operations are actually matrix-vector products.

Of course, if you are interested in a case where you have 5,000 input tokens and 1 output token, you are trending toward being entirely compute bound. Is that the scenario you're interested in?

michaelroyzen commented 11 months ago

I'm trying to generate 1000+ tokens with 5000+ context inputs. There's been an interesting update to FlashAttention that can parallelize the application of attentions across the KV cache activations: https://crfm.stanford.edu/2023/10/12/flashdecoding.html. Having this feature in TRT-LLM would be very helpful.

Is the embedding_sharding_dim in TRT-LLM similar in nature?

To that end, there's an inconsistency in build.py for llama:

parser.add_argument(
        '--embedding_sharding_dim',
        type=int,
        default=1,  # Meta does TP on hidden dim
        choices=[0, 1],
        help=
        'By default the embedding lookup table is sharded along vocab dimension (embedding_sharding_dim=0). '
        'To shard it along hidden dimension, set embedding_sharding_dim=1'
        'Note: embedding sharing is only enabled when embedding_sharding_dim = 0'
    )

The comment claims that the default is 0 but in fact it is 1. And that sharing is only enabled on dim 0. Would switching to dim 0 achieve a speedup?

Update: I tried adding --embedding_sharding_dim 0 to my build command from above and the engine build failed completely:

[10/23/2023-07:41:21] [TRT-LLM] [I] HF LLaMA loaded. Total time: 00:00:40 [10/23/2023-07:41:21] [TRT-LLM] [I] Loading weights from HF LLaMA... [10/23/2023-07:41:23] [TRT-LLM] [I] Weights loaded. Total time: 00:00:02 [10/23/2023-07:41:23] [TRT-LLM] [I] Context FMHA Enabled [10/23/2023-07:41:23] [TRT-LLM] [I] Remove Padding Enabled [10/23/2023-07:41:23] [TRT-LLM] [I] Paged KV Cache Enabled Traceback (most recent call last): File "/home/ubuntu/TensorRT-LLM/examples/llama/build.py", line 714, in mp.spawn(build, nprocs=args.world_size, args=(args, )) File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 239, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 197, in start_processes while not context.join(): File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 160, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException: Process 0 terminated with the following error: Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, args) File "/home/ubuntu/TensorRT-LLM/examples/llama/build.py", line 689, in build engine = build_rank_engine(builder, builder_config, engine_name, File "/home/ubuntu/TensorRT-LLM/examples/llama/build.py", line 623, in build_rank_engine tensorrt_llm_llama(inputs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in call return self.forward(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/model.py", line 328, in forward hidden_states = super().forward(input_ids, position_ids, use_cache, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/model.py", line 213, in forward hidden_states = self.vocab_embedding(input_ids) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in call return self.forward(args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/layers/embedding.py", line 62, in forward return embedding(x, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 1884, in embedding x = allreduce(x, tp_group) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 2762, in allreduce plug_inputs.append(workspace.trt_tensor) AttributeError: 'NoneType' object has no attribute 'trt_tensor'

jdemouth-nvidia commented 11 months ago

You may want to take a look at the multi_block_mode in our Attention plugin. It's not too far from flash decoding.

The embedding sharing allows you to distribute the embedding table across the different GPUs in one of the two dimensions (rows or columns).

The build fails when you use:

python build.py --model_dir ./Phind-CodeLlama-34B-v2 \
                --dtype bfloat16 \
                --remove_input_padding \
                --use_gpt_attention_plugin bfloat16 \
                --enable_context_fmha \
                --use_gemm_plugin bfloat16 \
                --paged_kv_cache \
                --use_parallel_embedding \
                --use_inflight_batching \
                --max_input_len 14848 \
                --max_output_len 1536 \
        --vocab_size 32000 \
                --rotary_base 1000000 \
                --output_dir ./Phind/Phind-CodeLlama-34B-v2/trt-engines/bf16/8-gpu \
                --world_size 8 \
                --tp_size 8 \
                --parallel_build \
                --embedding_sharding_dim 0

Is that correct? If so, I'm going to ask the engineer who implemented that feature to take a look at this issue.

michaelroyzen commented 11 months ago

Thanks, Julien. How do I enable the multi_block_mode for Llama? It doesn't seem to be directly supported by the Llama build script. I'd be happy to make any necessary modifications, but I'd appreciate some pointers. For context, I’m running the model using the Triton backend.

As for the embedding_sharding build failures, that only happens when --use_custom_all_reduce is also enabled. The build worked without it.

jdemouth-nvidia commented 11 months ago

Thanks Michael. Let me ask the engineer who implemented multi_block_mode about an example. And, let me talk to the other engineer who worked on custom_all_reduce regarding the crash ;)

jdemouth-nvidia commented 11 months ago

Hi @michaelroyzen , we’ve been able to reproduce the issue with custom_all_reduce and the parallel embedding. We will work on a fix and update the main branch with that fix (and a couple of other ones) when it’s ready. Sorry about that.

byshiue commented 11 months ago

Hi, @michaelroyzen, a quick way to enable multi_block_mode in Llama is adding

multi_block_mode=True

here https://github.com/NVIDIA/TensorRT-LLM/blob/release/0.5.0/tensorrt_llm/models/llama/model.py#L81. Please take a try.

We don't add the argument in current builder args and we cannot enable it directly. We will add it soon. Sorry for convenience.

michaelroyzen commented 11 months ago

Thank you @byshiue — will this work when running with the Triton backend? If not, can you please provide guidance for running multi_block_mode with Triton as well?

byshiue commented 11 months ago

Thank you @byshiue — will this work when running with the Triton backend? If not, can you please provide guidance for running multi_block_mode with Triton as well?

Yes. The model building is independing to triton serving. So, you can build the model with multi_block_mode=True first, copying the model to triton backend model folder, and launch serving.

byshiue commented 11 months ago

I have checked to make sure the process would enter the https://github.com/NVIDIA/TensorRT-LLM/blob/release/0.5.0/tensorrt_llm/models/llama/model.py#L81. Do you reinstall the tensorrt-llm after changing the codes?

shangz-ai commented 11 months ago

We also have an implicit heuristic that: even passing in multi_block_mode=True, only when the number of tokens (input + generated) > 1024, the multi-block parallelization is "truly" turned on. This is because the multi-block mode is an optimization only designed for long sequence length of generation phase.

michaelroyzen commented 11 months ago

Thank you, I am seeing a massive ~30% speedup for longer context lengths now. I'm going to close the issue, but I'm curious why multi_block_mode=True isn't enabled by default. Given the existing heuristic for it to be enabled even if the flag is set, it seems that there aren't any tradeoffs to having it be the default.

jdemouth-nvidia commented 11 months ago

Thanks a lot for all your efforts and the great feedback @michaelroyzen! I truly appreciate it. Long story short, @shangz-ai and I started the work on that feature few months ago (for FasterTransformer) and we never had time to properly evaluate the impact on a sufficient number of workloads. Now, that we have a first release of TensorRT-LLM (phew ;)), we will do the work needed to better characterise how performance changes with the feature and improve our heuristic for it. If we do not find cases that regress, we will probably enable it by default.

littletomatodonkey commented 5 months ago

Thank you, I am seeing a massive ~30% speedup for longer context lengths now. I'm going to close the issue, but I'm curious why multi_block_mode=True isn't enabled by default. Given the existing heuristic for it to be enabled even if the flag is set, it seems that there aren't any tradeoffs to having it be the default.

Hi, @michaelroyzen , which way do you use for the multi block mode? I use GptManager interface with 4k input and 1k output, no speedpup is found.