NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.55k stars 971 forks source link

Unable to build Llama-2-13B-Chat on RTX 4070Ti #1045

Open kaalen opened 9 months ago

kaalen commented 9 months ago

System Info

Package versions:

Who can help?

No response

Information

Tasks

Reproduction

  1. Download Llama-2-13b-chat-hf model from https://huggingface.co/meta-llama/Llama-2-13b-chat-hf

  2. Download AWQ weights for building the TensorRT engine model.pt from https://catalog.ngc.nvidia.com/orgs/nvidia/models/llama2-13b/files?version=1.2

  3. Initiate build of the model using a single GPU: python build.py --model_dir .\tmp --quant_ckpt_path .\tmp\model.pt --dtype float16 --use_gpt_attention_plugin float16 --use_gemm_plugin float16 --use_weight_only --weight_only_precision int4_awq --per_group --enable_context_fmha --max_batch_size 1 --max_input_len 3000 --max_output_len 1024 --output_dir .\tmp\out

Expected behavior

Build a trt engine for GTX 4070 Ti

actual behavior

Build terminates after reporting memory allocation issue: Requested amount of GPU memory (1024 bytes) could not be allocated. There may not be enough free memory for allocation to succeed.

Output:

[02/04/2024-20:42:02] [TRT] [W] Missing scale and zero-point for tensor LLaMAForCausalLM/lm_head/CONSTANT_2_output_0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[02/04/2024-20:42:02] [TRT] [W] Missing scale and zero-point for tensor logits, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[02/04/2024-20:42:02] [TRT] [W] Detected layernorm nodes in FP16.
[02/04/2024-20:42:02] [TRT] [W] Running layernorm after self-attention in FP16 may cause overflow. Exporting the model to the latest available ONNX opset (later than opset 17) to use the INormalizationLayer, or forcing layernorm layers to run in FP32 precision can help with preserving accuracy.
[02/04/2024-20:42:02] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[02/04/2024-20:42:02] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +0, now: CPU 29820, GPU 12281 (MiB)
[02/04/2024-20:42:02] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +14, GPU +0, now: CPU 29839, GPU 12281 (MiB)
[02/04/2024-20:42:02] [TRT] [W] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.8.1
[02/04/2024-20:42:02] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[02/04/2024-20:42:02] [TRT] [E] 1: [virtualMemoryBuffer.cpp::nvinfer1::StdVirtualMemoryBufferImpl::resizePhysical::127] Error Code 1: Cuda Driver (invalid argument)
[02/04/2024-20:42:02] [TRT] [W] Requested amount of GPU memory (1024 bytes) could not be allocated. There may not be enough free memory for allocation to succeed.

additional notes

I have a somewhat limited understanding of what I'm doing here. I'm trying to run the developer reference project for creating Retrieval Augmented Generation (RAG) chatbots on Windows using TensorRT-LLM. Following these instructions: https://github.com/NVIDIA/trt-llm-rag-windows/tree/release/1.0?tab=readme-ov-file#building-trt-engine

I'm not sure whether I'm pushing the limits of my hardware here or do I have any options to tweak parameters and get this processed in smaller batches to avoid a memory issue. I tried playing with different --max_input_len and --max_output_len values, reducing down to 512 but that doesn't seem to make any difference.

AreDubya commented 9 months ago

I have the same issue, with a 4090, 13900k, Win11, 32GB RAM.

I don't think the RAG application is workable at the moment.

Pip Error

PS C:\Users\rw\inference\TensorRT> pip install "tensorrt_llm==0.5.0" --extra-index-url https://pypi.nvidia.com --extra-index-url https://download.pytorch.org/whl/cu121 Looking in indexes: https://pypi.org/simple, https://pypi.nvidia.com, https://download.pytorch.org/whl/cu121 Collecting tensorrt_llm==0.5.0 Using cached https://pypi.nvidia.com/tensorrt-llm/tensorrt_llm-0.5.0-0-cp310-cp310-win_amd64.whl (431.5 MB) Collecting build (from tensorrt_llm==0.5.0) Using cached build-1.0.3-py3-none-any.whl.metadata (4.2 kB) INFO: pip is looking at multiple versions of tensorrt-llm to determine which version is compatible with other requirements. This could take a while. ERROR: Could not find a version that satisfies the requirement torch==2.1.0.dev20230828+cu121 (from tensorrt-llm) (from versions: 1.11.0, 1.12.0, 1.12.1, 1.13.0, 1.13.1, 2.0.0, 2.0.1, 2.1.0, 2.1.0+cu121, 2.1.1, 2.1.1+cu121, 2.1.2, 2.1.2+cu121, 2.2.0, 2.2.0+cu121) ERROR: No matching distribution found for torch==2.1.0.dev20230828+cu121

CMake Error

CMake Error at tensorrt_llm/plugins/CMakeLists.txt:106 (set_target_properties): set_target_properties called with incorrect number of arguments.

So far, none of the examples in the Get Started blog post or next steps listed in the windows/README.md are usable. The inclusion of examples/llama as a showcase seems fairly short-sighted, as in order to generate the quantized weights file required one has to have Triton, which is apparent due to the errors when running the recommended GPTQ Weight Quantization.

I'd like to have a fix for this just because of general principles, but the amount of time it takes to discover that even the documented workflows don't work really makes me question the value in this context.

hanikhatib commented 9 months ago

Good to know I'm not the only one running into this exact same issue.

bormanst commented 8 months ago

I had the same issue trying to build Code Llama 13b using the provided model.pt file and, after many hours, discovered that I can't read anymore. Upon further scrutiny, it works as advertised by using the npz file like it says:

from https://github.com/NVIDIA/trt-llm-as-openai-windows

... --quant_ckpt_path <path to CodeLlama .npz file> ...

Instead of model.pt use the .npz and .json files from https://catalog.ngc.nvidia.com/orgs/nvidia/models/llama2-13b/files?version=1.4 or version=1.3

For Code Llama 13b: I downloaded them separately instead of as a zipped package; not that it should matter but I was having the memory issue and many comments suggested corrupted files as the problem - it wasn't.

Of course, change according to Llama-2-13b-chat, but this worked for Code Llama 13b (note path to .npz file not a directory):

python build.py --model_dir E:\CodeLlamaInstruct\CodeLlama-13b-Instruct-hf --quant_ckpt_path E:\CodeLlamaCheckPoint\llama_tp1_rank0.npz --dtype float16 --remove_input_padding --use_gpt_attention_plugin float16 --enable_context_fmha --use_gemm_plugin float16 --use_weight_only --weight_only_precision int4_awq --per_group --max_batch_size 1 --max_input_len 15360 --max_output_len 1024 --output_dir E:\CodeLlamaEngine --rotary_base 1000000 --vocab_size 32064

jonny2027 commented 8 months ago

@bormanst I tried your solution but I get the following errors. What version of TensorRT/TensorRT-LLM did you build was it 0.5 or higher? Did you manage to build the engine?

[03/06/2024-11:45:50] [TRT-LLM] [I] Loading weights from groupwise AWQ LLaMA safetensors... Traceback (most recent call last): File "C:\Development\llm-models\trt\TensorRT-LLM\examples\llama\build.py", line 718, in <module> build(0, args) File "C:\Development\llm-models\trt\TensorRT-LLM\examples\llama\build.py", line 689, in build engine = build_rank_engine(builder, builder_config, engine_name, File "C:\Development\llm-models\trt\TensorRT-LLM\examples\llama\build.py", line 543, in build_rank_engine load_func(tensorrt_llm_llama=tensorrt_llm_llama, File "C:\Development\llm-models\trt\TensorRT-LLM\examples\llama\weight.py", line 1063, in load_from_awq_llama assert False, "Quantized checkpoint format not supported!" AssertionError: Quantized checkpoint format not supported!

bormanst commented 8 months ago

@jonny2027

I'm pretty sure it was TensorRT-LLM 0.7.0; 0.7.1 was causing issues.

I'm using a RTX4070, but ran across a matrix on nVidia's website (can't find it now for some reason) that showed whether the gpu core supported int4_awq (all Ada's did). So hopefully that is not the issue, I'll assume not.

It's pretty much a mad rush right now and packages are not stable which causes plenty of confusion. If you do a default git clone it pulls the 'main' branch which is the, up to date, experimental branch. I used the 'clone -b rel' to get the release version which was 0.7.0 at the time; I just noticed that 'rel' branch just changed to 0.8.0 so...

I was just messing around trying to build on win10 so I used the provided https://github.com/NVIDIA/TensorRT-LLM/tree/v0.7.0/windows#quick-start to build my TensorRT-LLM.

Then proceeded to follow https://github.com/NVIDIA/trt-llm-as-openai-windows 'very closely' and it worked for CodeLlama-13b-Instruct-hf and Llama-2-13b-chat-hf.

If I remember correctly, I kept getting that error until I explicitly set -quant_ckpt_path to point to the the actual file llama_tp1_rank0.npz and not it's parent directory or the base dir of the model.pt.

Sorry, I can't be more helpful, but it was a blur of 'pip package version hell' that I wasn't expecting; should have conda'd or docker'd it to start.

Been using win10 because it's easier to develop UE5, but just activated Ubuntu on WSL2 and built a local tritonserver docker as well as tensorrt and tensorrt-main docker images; plan on building one of the starcoder2 llms using the new Ubuntu setup today. The nVidia 'gpu passthrough' tech works as advertised on win10, so far. Can't wait to build another Linux box though, as AI appears to be a Linux based frontier.

Update: Got successful build again of CodeLlama-13b-Instruct-hf (I know this thread is for Chat version but...) This time using the confusing Win10>>>WSL2>>>Ubuntu>>>Docker>>>TensorRT-LLM process.

Ran into issues trying to build with anything other than v0.7.0 and got 'pip'ed again with that too, modify the 2 instructions at https://github.com/NVIDIA/trt-llm-as-openai-windows/blob/release/1.0/README.md :

1) tensorrt_llm: pip install tensorrt_llm==v0.7.0 --extra-index-url https://pypi.nvidia.com --extra-index-url https://download.pytorch.org/whl/cu121

2) TensorRT-LLM.git: git clone --depth 1 --branch v0.7.0 https://github.com/NVIDIA/TensorRT-LLM.git

Note: the two versions just happened to be synced at 'v0.7.0' but I'm not sure if that is coincidence or not.

I'm sure it's obvious to some, but tortuous for ignorants like me, that TensorRT-LLM and tensorrt_llm have to be synced (did not notice anything in the error messages); try to touch the hot burner only once though...

Update 2: Had the chance so tried and was successful at building the Llama-2-13b-chat-hf engine using the https://catalog.ngc.nvidia.com/orgs/nvidia/models/llama2-13b/files?version=1.4 checkpoint, TensorRT-LLM/tensorrt_llm v0.7.0; Again, using the convoluted building process. Don't forget to rename the accompanying nvidia checkpoint json 'llama_tp1.json' to 'config.json'.

python3 build.py --model_dir /model-build/Llama-2-13b-chat-hf --quant_ckpt_path /model-build/Llama-2-13b-chat-hf/trt_ckpt/int4_awq/1-gpu/llama_tp1_rank0.npz --dtype float16 --use_gpt_attention_plugin float16 --use_gemm_plugin float16 --use_weight_only --weight_only_precision int4_awq --per_group --enable_context_fmha --max_batch_size 1 --max_input_len 3500 --max_output_len 1024 --output_dir /model-build/Llama-2-13b-chat-hf/trt_engine/int4_awq/1-gpu

FWIW

bormanst commented 8 months ago

Just an FYI that should be obvious but wasn't:

I'm sure it's in the docs somehwere but there is ZERO chance of converting these large 13B model checkpoints on a 4070; have to wait for nVidia to post them to use for lower vram builds; the checkpoints posted by nVidia are about 25GB so...

The convert_checkpoint.py script always runs out of memory on my 4070: nvidia-smi. Maybe the new SuperTi 16GB cards will have better luck but I wouldn't hold my breath.

jonny2027 commented 8 months ago

Scratch that. I had the the wrong path for the model directory and I am back to getting assert False, "Quantized checkpoint format not supported!"

jonny2027 commented 8 months ago

I think I will have to try to get 0.7.0 to compile again but I am getting linker issues due to the batch manager not existing and I had to use the one from 0.5.0 as it doesnt exist in the branch. Did you have this problem @bormanst or did you compile on WSL2

cmd.exe /C "cmd.exe /C ""C:\Program Files\CMake\bin\cmake.exe" -E __create_def C:\Development\llm-models\trt\TensorRT-LLM-0.7.1\cpp\build\tensorrt_llm\CMakeFiles\tensorrt_llm.dir\.\exports.def C:\Development\llm-models\trt\TensorRT-LLM-0.7.1\cpp\build\tensorrt_llm\CMakeFiles\tensorrt_llm.dir\.\exports.def.objs && cd C:\Development\llm-models\trt\TensorRT-LLM-0.7.1\cpp\build" && "C:\Program Files\CMake\bin\cmake.exe" -E vs_link_dll --intdir=tensorrt_llm\CMakeFiles\tensorrt_llm.dir --rc=C:\PROGRA~2\WI3CF2~1\10\bin\100226~1.0\x64\rc.exe --mt=C:\PROGRA~2\WI3CF2~1\10\bin\100226~1.0\x64\mt.exe --manifests -- C:\PROGRA~1\MICROS~1\2022\PROFES~1\VC\Tools\MSVC\1438~1.331\bin\Hostx64\x64\link.exe /nologo @CMakeFiles\tensorrt_llm.rsp /out:tensorrt_llm\tensorrt_llm.dll /implib:tensorrt_llm\tensorrt_llm.lib /pdb:tensorrt_llm\tensorrt_llm.pdb /dll /version:0.0 /machine:x64 /INCREMENTAL:NO /DEF:tensorrt_llm\CMakeFiles\tensorrt_llm.dir\.\exports.def && cd ." LINK: command "C:\PROGRA~1\MICROS~1\2022\PROFES~1\VC\Tools\MSVC\1438~1.331\bin\Hostx64\x64\link.exe /nologo @CMakeFiles\tensorrt_llm.rsp /out:tensorrt_llm\tensorrt_llm.dll /implib:tensorrt_llm\tensorrt_llm.lib /pdb:tensorrt_llm\tensorrt_llm.pdb /dll /version:0.0 /machine:x64 /INCREMENTAL:NO /DEF:tensorrt_llm\CMakeFiles\tensorrt_llm.dir\.\exports.def /MANIFEST:EMBED,ID=2" failed (exit code 1120) with the following output: tensorrt_llm_batch_manager_static.lib(trtGptModelInflightBatching.obj) : error LNK2005: "public: static class tensorrt_llm::common::Logger * __cdecl tensorrt_llm::common::Logger::getLogger(void)" (?getLogger@Logger@common@tensorrt_llm@@SAPEAV123@XZ) already defined in logger.cpp.obj tensorrt_llm_batch_manager_static.lib(microBatchScheduler.obj) : error LNK2005: "public: static class tensorrt_llm::common::Logger * __cdecl tensorrt_llm::common::Logger::getLogger(void)" (?getLogger@Logger@common@tensorrt_llm@@SAPEAV123@XZ) already defined in logger.cpp.obj tensorrt_llm_batch_manager_static.lib(batchScheduler.obj) : error LNK2005: "public: static class tensorrt_llm::common::Logger * __cdecl tensorrt_llm::common::Logger::getLogger(void)" (?getLogger@Logger@common@tensorrt_llm@@SAPEAV123@XZ) already defined in logger.cpp.obj

bormanst commented 8 months ago

@jonny2027

It looks like the headers are not synced correctly with 0.7.1; there's some serious refactoring happening with TensorRT at the moment so this should be expected outside the 'rel' branches.

I couldn't get 0.7.1 to work, just 0.7.0. It's worth noting that 0.7.0 git doesn't come with the tensorrt_llm libs that it requires so it by default errors out; had to manually copy (symlinks didn't work) the libs directory from the pip install location for the tensorrt_llm package. Depending on your setup, it should be located in the python site-packages dir under tensorrt_llm. Just manually copy the libs dir into the TensorRT-LLM/tensorrt_llm dir and it should resolve that error.

I did successfully build blip-2, whisper, CodeLlama-13b-Instruct-hf, and Llama-2-13b-chat-hf with v0.7.0 using the win10 build process. I built Santacoder, CodeLlama-13b-Instruct-hf and Llama-2-13b-chat-hf using the WSL2/Ubuntu process yesterday using v0.7.0 after trying all the differnt combinations.

It just seemed futile to keep having to compile custom Win10 pip packages to comabat the frenetic pace of development going on. The Win10>>>WSL2>>>Ubuntu>>>Docker>>>TensorRT-LLM process sounds difficult but it's not; the only real downside is that the docker images take more space than a monolithic solution. The upside is that the newest shiniest smallest models can be swapped in/out of the stack w/o much effort; as you can see, all the models are improving monthly so I figured it would be best to be more flexible at the expense of storage space.

This guy had the goods for setting up Win10>>>WSL2>>>Ubuntu>>>Docker>>>Tensorflow and the tensorflow build worked: . The only tricky stuff is making sure the mappings are correct for proper piping between the environments. i.e. There are plenty of examples how to map a Win10 folder to a directory in WSL2/Ubuntu as well as how to map WSL2/Ubuntu dirs to Docker dirs. Most of the commands, like the running docker images, allow for mappings to be sent on the command line. Any other mappings are usually 'one-and-done' types like repo locations, etc., that you only have to do as needed so not that big of a deal. I only had to use the command line mappings and didn't need to map anything special to get the building process working.

The TensorRT build process looks like it is currently being refactored (v0.9.0-) into a single build method instead of all the fractured model methods. At their current pace, that should be finished in a month and will make it much easier to build.

I did manage to get Starcoder2-3B almost built (WSL2/Ubuntu) using the latest 'main' branch but ran out of vram just when it was copying the model out; there were errors in downstream pip packages that had to be resolved and most of the model builds have not been ported to the new process yet.

jonny2027 commented 8 months ago

@bormanst Thanks for all the info. I’ll work with 0.7.0 on windows and keep trying. Did you build with cpp bindings? I’m guessing that’s why I am getting the issue with the BatchManager.

Did you have an X86/X64 version of the BatchManager?

bormanst commented 8 months ago

@jonny2027

I haven't done anything special regarding the Batch Manager so I'm not sure if you included that separately or if it's just in the base build like I would assume. Anyway, I don't recall seeing any mention of 'Batch Manager' errors scrolling by. If you're building everything from source and 'choosing' the 'Batch Manager' as an option, then I would try w/o it first. Cpp has a couple of models for implemenation that can cause undesirable side effects for developers, especially when it comes to headers and inheritance - that is what your errors look like. There's many cooks in the kitchen and no standards yet so I would expect this stuff to continue for some time.

From https://github.com/NVIDIA/TensorRT-LLM/tree/v0.7.0/windows#quick-start :

If you are using the pre-built TensorRT-LLM release wheel (recommended unless you need to directly invoke the C++ runtime), skip to [Installation](https://github.com/NVIDIA/TensorRT-LLM/tree/v0.6.1/windows#installation). If you are building your own wheel from source, proceed to [Building from Source](https://github.com/NVIDIA/TensorRT-LLM/tree/v0.6.1/windows#building-from-source).

I used the prebuilt wheel as recommended (just be wary of wheel version being 'synced' with the TensorRT-LLM build as mentioned earlier):

pip install tensorrt_llm==v0.7.0 --extra-index-url https://pypi.nvidia.com/ --extra-index-url https://download.pytorch.org/whl/cu121

I already had the visual studio cpp dev boat installed on my system for UE5 so I skipped that part of the process; it's possible that they are missing a step to install a particular cpp component but I would rate that as a small possibilty. I also used the prebuilt wheel for the WSL2/Ubuntu builds w/o issue.

jonny2027 commented 8 months ago

Ah ok. That makes sense that you are using the prebuilt wheel. I was trying to compile the wheel from source. I will give it a go tomorrow with the prebuilt wheel.

Thanks

bormanst commented 8 months ago

The pain of ignorance; The 'sync' issue I was having between the tensorrt_llm wheel and TensorRT-LLM was a direct result of 'manually copying' the libs into the TensorRT-LLM/tensorrt_llm directory, and forgetting to do it again after trying a different wheel version.

They are refactoring the build process now which will render all this moot, so it's just an FYI.

I know v0.7.0 works, but I'm not sure if it was just me messing up the others.

bormanst commented 7 months ago

@maazaahmed

They are refactoring the build process to use a unified trtllm-build instead of the segmented build.py scripts; this does not affect the v0.7.0 branch which still uses the build.py method. The build.py scripts were referencing stuff in the local TensorRT-LLM env while also using the tensorrt_llm pip module which can lead to 'sync' issues between the two envs. The trtllm-build cmd appears to just use the tensorrt_llm pip module since you no longer have to have the TensorRT-LLM directory to do a build. I saw a note somewhere that the trtllm-build script only works on a few models right now, don't remember which, until they finish converting the others to work with the new process.

This is from thre days ago: Update TensorRT-LLM #1233

[BREAKING CHANGE] Move LLaMA convert checkpoint script from examples directory into the core library Support in LLM() API to accept engines built by trtllm-build command

Try using the v.0.7.0 links above - worked on both Win10 and Ubuntu; note the 'version' in the urls for the instructions as they vary too; would use the prebuilt wheel as recommended too.

bormanst commented 7 months ago

Using Win10/WSL2/Ubuntu, I managed to get v0.8.0 to build Llama-2-13b-chat-hf using the new trtllm-build process and new nVidia checkpoint format v1.5.

There's some obvious missing steps in the install docs, like needing to 'apt-get install git-lfs', but it is the easiest process so far and produced a viable build on the first run. The new checkpoints are already in safetensor format and much, much smaller.

Note: Not all models have been incorporated into the new trtllm-build process, nor does nVidia have the new checkpoint formats for all models. Don't expect to convert the checkpoints of larger models on 12GB of vram, waiting for the nVidia 'pre-converted' is probably the only solution since the models are custom built for each specific system. The good news, though, is nVidia has been pretty quick doing release updates.