NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.44k stars 802 forks source link

Unable to build engine for llama on windows 10 #1046

Open teis-e opened 5 months ago

teis-e commented 5 months ago

System Info

Who can help?

@Tracin @juney-nvidia

Information

Tasks

Reproduction

First of all i was trying to run TRT-RAG-TensorLLM with the pre-build .engine for RTX4090. But this lead me to this error:

(venv) PS C:\Users\TeisWin10\clear\trt-llm-rag-windows> python3 test.py
[02/04/2024-20:03:18] [TRT] [E] 1: [stdArchiveReader.cpp::nvinfer1::rt::StdArchiveReader::stdArchiveReaderInitCommon::47] Error Code 1: Serialization (Serialization assertion stdVersionRead == kSERIALIZATION_VERSION failed.Version tag does not match. Note: Current Version: 228, Serialized Engine Version: 226)
Traceback (most recent call last):
  File "C:\Users\TeisWin10\clear\trt-llm-rag-windows\test.py", line 16, in <module>
    llm = LocalTensorRTLLM(
  File "C:\Users\TeisWin10\AppData\Local\Programs\Python\Python310\lib\site-packages\llama_index\llms\nvidia_tensorrt.py", line 173, in __init__
    decoder = tensorrt_llm.runtime.GenerationSession(
  File "C:\Users\TeisWin10\AppData\Local\Programs\Python\Python310\lib\site-packages\tensorrt_llm\runtime\generation.py", line 457, in __init__
    self.runtime = _Runtime(engine_buffer, mapping)
  File "C:\Users\TeisWin10\AppData\Local\Programs\Python\Python310\lib\site-packages\tensorrt_llm\runtime\generation.py", line 150, in __init__
    self.__prepare(mapping, engine_buffer)
  File "C:\Users\TeisWin10\AppData\Local\Programs\Python\Python310\lib\site-packages\tensorrt_llm\runtime\generation.py", line 168, in __prepare
    assert self.engine is not None
AssertionError
Exception ignored in: <function _Runtime.__del__ at 0x0000028D2AC1C280>
Traceback (most recent call last):
  File "C:\Users\TeisWin10\AppData\Local\Programs\Python\Python310\lib\site-packages\tensorrt_llm\runtime\generation.py", line 266, in __del__
    cudart.cudaFree(self.address)  # FIXME: cudaFree is None??
AttributeError: '_Runtime' object has no attribute 'address'

It has an unmatched engine version. I guess this is because the pre-built engine was build sometime a go and it should be re-built.

Then i tried to build it myself and this happened:

(venv) PS C:\Users\TeisWin10\TensorRT-LLM\examples\llama> python build.py --model_dir C:\Users\TeisWin10\clear\trt-llm-rag-windows\meta-llama\Llama-2-13b-chat --quant_ckpt_path models\ --dtype float16 --use_gpt_attention_plugin float16 --use_gemm_plugin float16 --use_weight_only --weight_only_precision int4_awq --per_group --enable_context_fmha --max_batch_size 1 --max_input_len 3000 --max_output_len 1024 --output_dir models\Engine
Traceback (most recent call last):
  File "C:\Users\TeisWin10\TensorRT-LLM\examples\llama\build.py", line 22, in <module>
    import torch
  File "C:\Users\TeisWin10\clear\trt-llm-rag-windows\venv\lib\site-packages\torch\__init__.py", line 128, in <module>
    raise err
OSError: [WinError 127] The specified procedure could not be found. Error loading "C:\Users\TeisWin10\clear\trt-llm-rag-windows\venv\lib\site-packages\torch\lib\cudnn_adv_train64_8.dll" or one of its dependencies.

Expected behavior

A working pre-built engine for rtx4090

Or

A working build script for building the engine

actual behavior

Both have an error.

additional notes

I installed all nescacry packages: cuDNN, MPI, CUDA Toolkit.

There seemed to be a fix for: [WinError 127] The specified procedure could not be found. Error loading "C:\Users\TeisWin10\clear\trt-llm-rag-windows\venv\lib\site-packages\torch\lib\cudnn_adv_train64_8.dll" or one of its dependencies.

Which is described here

But this is not working for me.

MustaphaU commented 5 months ago

@teis-e have you tried building using the docker image provided? Building on bare windows never worked for me. Check out the instructions on the rel branch to build with docker on windows.

teis-e commented 5 months ago

No I haven't yet. I will try that.

Will that give me the option to build the TRT engine's as well?

MustaphaU commented 5 months ago

No I haven't yet. I will try that.

Will that give me the option to build the TRT engine's as well?

Once you've built and installed the wheel, I believe you have to navigate to the examples\llama directory and run the script to build the engine:

python build.py --model_dir <path to llama13_chat model> --quant_ckpt_path <path to model.pt> --dtype float16 --use_gpt_attention_plugin float16 --use_gemm_plugin float16 --use_weight_only --weight_only_precision int4_awq --per_group --enable_context_fmha --max_batch_size 1 --max_input_len 3000 --max_output_len 1024 --output_dir <TRT engine folder>
teis-e commented 5 months ago

Ah yes exactly. And this is al done in the docker container? Instead of the powershell?

MustaphaU commented 5 months ago

You should build the wheel in the container, then install the built wheel on your PC. First, make sure to mount a folder from your PC in the container. Once the build is complete, copy the generated *.whl file to the mounted folder. This ensures the file is available for installation on your PC. You may exit the docker workspace afterwards. Detailed instructions are in the Docker Build Instructions section here: https://github.com/NVIDIA/TensorRT-LLM/blob/rel/windows/README.md

teis-e commented 5 months ago

I understand now.

Altought running: docker build --no-cache -t tensorrt-llm-windows-build:latest .

Hangs here for already 20 minutes, is this normal?:

Sending build context to Docker daemon   1.06MB
Step 1/24 : FROM mcr.microsoft.com/windows/servercore:ltsc2019
 ---> 5f9e1fdbbeba
Step 2/24 : SHELL ["cmd", "/S", "/C"]
 ---> Running in 554be3803cd8
 ---> Removed intermediate container 554be3803cd8
 ---> de26761d1c87
Step 3/24 : RUN powershell -Command     $ErrorActionPreference = 'Stop';     Invoke-WebRequest -Uri https://developer.download.nvidia.com/compute/cuda/12.2.2/local_installers/cuda_12.2.2_537.13_windows.exe     -OutFile "cuda_installer.exe";     Start-Process cuda_installer.exe -Wait -ArgumentList '-s';     Remove-Item cuda_installer.exe -Force
 ---> Running in b4c22a8124c9
MustaphaU commented 5 months ago

I understand now.

Altought running: docker build --no-cache -t tensorrt-llm-windows-build:latest .

Hangs here for already 20 minutes, is this normal?:

Sending build context to Docker daemon   1.06MB
Step 1/24 : FROM mcr.microsoft.com/windows/servercore:ltsc2019
 ---> 5f9e1fdbbeba
Step 2/24 : SHELL ["cmd", "/S", "/C"]
 ---> Running in 554be3803cd8
 ---> Removed intermediate container 554be3803cd8
 ---> de26761d1c87
Step 3/24 : RUN powershell -Command     $ErrorActionPreference = 'Stop';     Invoke-WebRequest -Uri https://developer.download.nvidia.com/compute/cuda/12.2.2/local_installers/cuda_12.2.2_537.13_windows.exe     -OutFile "cuda_installer.exe";     Start-Process cuda_installer.exe -Wait -ArgumentList '-s';     Remove-Item cuda_installer.exe -Force
 ---> Running in b4c22a8124c9

Yes, please.

teis-e commented 5 months ago

Are you sure? it is already taking over an hour

docker build -t tensorrt-llm-windows-build:latest .

Sending build context to Docker daemon   1.06MB
Step 1/24 : FROM mcr.microsoft.com/windows/servercore:ltsc2019
ltsc2019: Pulling from windows/servercore
cb524f6f2215: Pull complete
1581446d2913: Pull complete
Digest: sha256:097949cfe0247fde3f8457a4d68fffee63a2385fb83e3be4f5d0dd9a46e9a3c3
Status: Downloaded newer image for mcr.microsoft.com/windows/servercore:ltsc2019
 ---> 5f9e1fdbbeba
Step 2/24 : SHELL ["cmd", "/S", "/C"]
 ---> Running in 30d77bac1fd3
 ---> Removed intermediate container 30d77bac1fd3
 ---> d81bdbaae75f
Step 3/24 : RUN powershell -Command     $ErrorActionPreference = 'Stop';     Invoke-WebRequest -Uri https://developer.download.nvidia.com/compute/cuda/12.2.2/local_installers/cuda_12.2.2_537.13_windows.exe     -OutFile "cuda_installer.exe";     Start-Process cuda_installer.exe -Wait -ArgumentList '-s';     Remove-Item cuda_installer.exe -Force
 ---> Running in a6a03caf1620

I have a steady gigabit download speed.

There is almost zero activity in the terminal. Some occasional ethernet receiving spikes of around 20mb every minute or so.

Will it instead be a option to build on Ubuntu? Does that work with TensorRT-LLM?

MustaphaU commented 5 months ago

On Windows (Docker). Following the instructions here, I didn't encounter any issues building the wheel: ensure to use the rel branch: git clone --branch https://github.com/NVIDIA/TensorRT-LLM.git. Let me know if you run into any issues, I can make a short recording and share.

image

teis-e commented 5 months ago

I feel like I tried everything. Now even on Ubuntu i can't get the llama model to work.

Could you accept me on LinkedIn to help me?

jonny2027 commented 4 months ago

Did you ever get this issue resolved? I am getting the same error

teis-e commented 4 months ago

No i didn't