NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.78k stars 1.01k forks source link

When run llama2, Caught signal 11 (Segmentation fault) #752

Open namang301 opened 11 months ago

namang301 commented 11 months ago

Hi, I had tried to test llama2 based on TensorRT-LLM.

my environments (based on "nvcr.io-nvidia-tritionserver-23.10-trtllm-python-py3"):

cuda 12.2 gpu A100 40G (1) python 3.10.12 ubuntu 22.04.3 tensorrt 9.2.0.post12.dev5 tensorrt-llm 0.7.0 triton 2.1.0

I tried to use llama2 based on examples/llama/README

I had success build, when I run : $ python build.py --model_dir meta-llama/llama-2-7b-hf\ --dtype float16 \ --remove_input_padding \ --use_gpt_attention_plugin float16 \ --enable_context_fmha \ --use_gemm_plugin float16 \ --output_dir ./tmp/llama/out/

when i tried to run engine after build : $ python3 ../run.py --max_output_len=50 \ --tokenizer_dir meta-llama/llama-2-7b-hf \ --engine_dir=./tmp/llama/out/

I faced to this error message

[Instance-1891:140515:0:140515] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x54000009) ==== backtrace (tid: 140515) ==== 0 0x0000000000042520 sigaction() ???:0 1 0x000000000006ce20 PMPI_Comm_set_errhandler() /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mpi/c/profile/pcomm_set_errhandler.c:81 2 0x000000000006ce20 opal_atomic_add_fetch_32() /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mpi/c/profile/../../../../opal/include/opal/sys/atomic_impl.h:384 3 0x000000000006ce20 opal_thread_add_fetch_32() /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mpi/c/profile/../../../../opal/threads/thread_usage.h:152 4 0x000000000006ce20 opal_obj_update() /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mpi/c/profile/../../../../opal/class/opal_object.h:534 5 0x000000000006ce20 PMPI_Comm_set_errhandler() /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mpi/c/profile/pcomm_set_errhandler.c:70 6 0x00000000000a728f pyx_f_6mpi4py_3MPI_comm_set_eh() /tmp/pip-install-8x8e5fta/mpi4py_ff745ee2b9414fd99054a30ef67df184/src/mpi4py.MPI.c:40330 7 0x00000000000a728f pyx_f_6mpi4py_3MPI_initialize() /tmp/pip-install-8x8e5fta/mpi4py_ff745ee2b9414fd99054a30ef67df184/src/mpi4py.MPI.c:8406 8 0x0000000000047e7c pyx_f_6mpi4py_3MPI_initialize() /tmp/pip-install-8x8e5fta/mpi4py_ff745ee2b9414fd99054a30ef67df184/src/mpi4py.MPI.c:8394 9 0x000000000023b2d3 PyModule_ExecDef() ???:0 10 0x000000000023bda0 PyInitthread() ???:0 11 0x000000000015f854 PyObject_GenericGetAttr() ???:0 12 0x000000000014b2c1 _PyEval_EvalFrameDefault() ???:0 13 0x000000000016070c _PyFunction_Vectorcall() ???:0 14 0x000000000014e8a2 _PyEval_EvalFrameDefault() ???:0 15 0x000000000016070c _PyFunction_Vectorcall() ???:0 16 0x0000000000148f52 _PyEval_EvalFrameDefault() ???:0 17 0x000000000016070c _PyFunction_Vectorcall() ???:0 18 0x0000000000148e0d _PyEval_EvalFrameDefault() ???:0 19 0x000000000016070c _PyFunction_Vectorcall() ???:0 20 0x0000000000148e0d _PyEval_EvalFrameDefault() ???:0 21 0x000000000016070c _PyFunction_Vectorcall() ???:0 22 0x000000000015fb24 PyObject_CallFunctionObjArgs() ???:0 23 0x000000000023f4af _PyObject_CallMethodIdObjArgs() ???:0 24 0x00000000001740ca PyImport_ImportModuleLevelObject() ???:0 25 0x0000000000184458 PyImport_Import() ???:0 26 0x000000000015fe0e PyObject_CallFunctionObjArgs() ???:0 27 0x000000000016f12b PyObject_Call() ???:0 28 0x000000000014b2c1 _PyEval_EvalFrameDefault() ???:0 29 0x000000000016070c _PyFunction_Vectorcall() ???:0 30 0x0000000000148e0d _PyEval_EvalFrameDefault() ???:0 31 0x000000000016070c _PyFunction_Vectorcall() ???:0 32 0x000000000015fb24 PyObject_CallFunctionObjArgs() ???:0 33 0x000000000023f4af _PyObject_CallMethodIdObjArgs() ???:0 34 0x0000000000174cda PyImport_ImportModuleLevelObject() ???:0 35 0x0000000000151216 _PyEval_EvalFrameDefault() ???:0 36 0x000000000016070c _PyFunction_Vectorcall() ???:0 37 0x0000000000148e0d _PyEval_EvalFrameDefault() ???:0 38 0x000000000016070c _PyFunction_Vectorcall() ???:0 39 0x000000000014e8a2 _PyEval_EvalFrameDefault() ???:0 40 0x000000000016070c _PyFunction_Vectorcall() ???:0 41 0x0000000000148e0d _PyEval_EvalFrameDefault() ???:0 42 0x0000000000239e56 PyEval_EvalCode() ???:0 43 0x0000000000239cf6 PyEval_EvalCode() ???:0 44 0x00000000002647d8 PyUnicode_Tailmatch() ???:0 45 0x000000000025e0bb PyInitcollections() ???:0 46 0x0000000000264525 PyUnicode_Tailmatch() ???:0 47 0x0000000000263a08 _PyRun_SimpleFileObject() ???:0 48 0x0000000000263653 _PyRun_AnyFileObject() ???:0 49 0x000000000025641e Py_RunMain() ???:0 50 0x000000000022ccad Py_BytesMain() ???:0 51 0x0000000000029d90 libc_init_first() ???:0 52 0x0000000000029e40 libc_start_main() ???:0 53 0x000000000022cba5 _start() ???:0

python3:140515 terminated with signal 11 at PC=7f887654de20 SP=7ffdd00247f0. Backtrace: /usr/local/mpi/lib/libmpi.so.40(PMPI_Comm_set_errhandler+0xb0)[0x7f887654de20] /usr/local/lib/python3.10/dist-packages/mpi4py/MPI.cpython-310-x86_64-linux-gnu.so(+0xa728f)[0x7f8875d4728f] /usr/local/lib/python3.10/dist-packages/mpi4py/MPI.cpython-310-x86_64-linux-gnu.so(+0x47e7c)[0x7f8875ce7e7c] python3(PyModule_ExecDef+0x73)[0x55704666c2d3] python3(+0x23bda0)[0x55704666cda0] python3(+0x15f854)[0x557046590854] python3(_PyEval_EvalFrameDefault+0x2b71)[0x55704657c2c1] python3(_PyFunction_Vectorcall+0x7c)[0x55704659170c] python3(_PyEval_EvalFrameDefault+0x6152)[0x55704657f8a2] python3(_PyFunction_Vectorcall+0x7c)[0x55704659170c] python3(_PyEval_EvalFrameDefault+0x802)[0x557046579f52] python3(_PyFunction_Vectorcall+0x7c)[0x55704659170c] python3(_PyEval_EvalFrameDefault+0x6bd)[0x557046579e0d] python3(_PyFunction_Vectorcall+0x7c)[0x55704659170c] python3(_PyEval_EvalFrameDefault+0x6bd)[0x557046579e0d] python3(_PyFunction_Vectorcall+0x7c)[0x55704659170c] python3(+0x15fb24)[0x557046590b24] python3(_PyObject_CallMethodIdObjArgs+0xff)[0x5570466704af] python3(PyImport_ImportModuleLevelObject+0x25a)[0x5570465a50ca] python3(+0x184458)[0x5570465b5458] python3(+0x15fe0e)[0x557046590e0e] python3(PyObject_Call+0xbb)[0x5570465a012b] python3(_PyEval_EvalFrameDefault+0x2b71)[0x55704657c2c1] python3(_PyFunction_Vectorcall+0x7c)[0x55704659170c] python3(_PyEval_EvalFrameDefault+0x6bd)[0x557046579e0d] python3(_PyFunction_Vectorcall+0x7c)[0x55704659170c] python3(+0x15fb24)[0x557046590b24] python3(_PyObject_CallMethodIdObjArgs+0xff)[0x5570466704af] python3(PyImport_ImportModuleLevelObject+0xe6a)[0x5570465a5cda] python3(_PyEval_EvalFrameDefault+0x8ac6)[0x557046582216] python3(_PyFunction_Vectorcall+0x7c)[0x55704659170c] python3(_PyEval_EvalFrameDefault+0x6bd)[0x557046579e0d] python3(_PyFunction_Vectorcall+0x7c)[0x55704659170c] python3(_PyEval_EvalFrameDefault+0x6152)[0x55704657f8a2] python3(_PyFunction_Vectorcall+0x7c)[0x55704659170c] python3(_PyEval_EvalFrameDefault+0x6bd)[0x557046579e0d] python3(+0x239e56)[0x55704666ae56] python3(PyEval_EvalCode+0x86)[0x55704666acf6] python3(+0x2647d8)[0x5570466957d8] python3(+0x25e0bb)[0x55704668f0bb] python3(+0x264525)[0x557046695525] python3(_PyRun_SimpleFileObject+0x1a8)[0x557046694a08] python3(_PyRun_AnyFileObject+0x43)[0x557046694653] python3(Py_RunMain+0x2be)[0x55704668741e] python3(Py_BytesMain+0x2d)[0x55704665dcad] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f8aa41e1d90] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f8aa41e1e40] python3(_start+0x25)[0x55704665dba5]

When I tested GPT-NeoX Examples (using Polyglot Models), I got same issue. I can use build.py that, but I can't use run.py that and got same error message.

I guess it is because of mpi.

Can any body help me solve this problem?

jdemouth-nvidia commented 11 months ago

How did you build TensorRT-LLM?

namang301 commented 11 months ago

Thanks for reply.

I had installed TensorRT-LLM using : $ pip3 install tensorrt_llm --extra-index-url https://pypi.nvidia.com/ --extra-index-url https://download.pytorch.org/whl/cu122

because, when I build using $ python3 ./scripts/build_wheel.py --trt_root /usr/local/tensorrt, I got this message

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com/ Requirement already satisfied: accelerate==0.20.3 in /usr/local/lib/python3.10/dist-packages (from -r requirements.txt (line 1)) (0.20.3) Requirement already satisfied: build in /usr/local/lib/python3.10/dist-packages (from -r requirements.txt (line 2)) (1.0.3) Requirement already satisfied: colored in /usr/local/lib/python3.10/dist-packages (from -r requirements.txt (line 3)) (2.2.4) Requirement already satisfied: cuda-python==12.2.0 in /usr/local/lib/python3.10/dist-packages (from -r requirements.txt (line 4)) (12.2.0) Requirement already satisfied: diffusers==0.15.0 in /usr/local/lib/python3.10/dist-packages (from -r requirements.txt (line 5)) (0.15.0) Requirement already satisfied: lark in /usr/local/lib/python3.10/dist-packages (from -r requirements.txt (line 6)) (1.1.8) Requirement already satisfied: mpi4py in /usr/local/lib/python3.10/dist-packages (from -r requirements.txt (line 7)) (3.1.5) Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from -r requirements.txt (line 8)) (1.26.1) Requirement already satisfied: onnx>=1.12.0 in /usr/local/lib/python3.10/dist-packages (from -r requirements.txt (line 9)) (1.15.0) Requirement already satisfied: polygraphy in /usr/local/lib/python3.10/dist-packages (from -r requirements.txt (line 10)) (0.49.0) Requirement already satisfied: sentencepiece>=0.1.99 in /usr/local/lib/python3.10/dist-packages (from -r requirements.txt (line 11)) (0.1.99) Requirement already satisfied: tensorrt>=8.6.0 in /usr/local/lib/python3.10/dist-packages (from -r requirements.txt (line 12)) (9.1.0.post12.dev4) Requirement already satisfied: torch in /usr/local/lib/python3.10/dist-packages (from -r requirements.txt (line 13)) (2.1.0) Requirement already satisfied: transformers==4.33.1 in /usr/local/lib/python3.10/dist-packages (from -r requirements.txt (line 14)) (4.33.1) Requirement already satisfied: wheel in /usr/local/lib/python3.10/dist-packages (from -r requirements.txt (line 15)) (0.41.2) Requirement already satisfied: optimum in /usr/local/lib/python3.10/dist-packages (from -r requirements.txt (line 16)) (1.16.1) Requirement already satisfied: evaluate in /usr/local/lib/python3.10/dist-packages (from -r requirements.txt (line 17)) (0.4.1) Requirement already satisfied: datasets in /usr/local/lib/python3.10/dist-packages (from -r requirements-dev.txt (line 2)) (2.16.0) Requirement already satisfied: einops in /usr/local/lib/python3.10/dist-packages (from -r requirements-dev.txt (line 3)) (0.7.0) Requirement already satisfied: graphviz in /usr/local/lib/python3.10/dist-packages (from -r requirements-dev.txt (line 5)) (0.20.1) Requirement already satisfied: mypy in /usr/local/lib/python3.10/dist-packages (from -r requirements-dev.txt (line 6)) (1.8.0) Requirement already satisfied: parameterized in /usr/local/lib/python3.10/dist-packages (from -r requirements-dev.txt (line 7)) (0.9.0) Requirement already satisfied: pre-commit in /usr/local/lib/python3.10/dist-packages (from -r requirements-dev.txt (line 8)) (3.6.0) Requirement already satisfied: pybind11-stubgen in /usr/local/lib/python3.10/dist-packages (from -r requirements-dev.txt (line 9)) (2.4.2) Requirement already satisfied: pynvml>=11.5.0 in /usr/local/lib/python3.10/dist-packages (from -r requirements-dev.txt (line 10)) (11.5.0) Requirement already satisfied: pytest-cov in /usr/local/lib/python3.10/dist-packages (from -r requirements-dev.txt (line 11)) (4.1.0) Requirement already satisfied: pytest-forked in /usr/local/lib/python3.10/dist-packages (from -r requirements-dev.txt (line 12)) (1.6.0) Requirement already satisfied: pytest-xdist in /usr/local/lib/python3.10/dist-packages (from -r requirements-dev.txt (line 13)) (3.5.0) Requirement already satisfied: rouge_score in /usr/local/lib/python3.10/dist-packages (from -r requirements-dev.txt (line 14)) (0.1.2) Requirement already satisfied: typing-extensions==4.8.0 in /usr/local/lib/python3.10/dist-packages (from -r requirements-dev.txt (line 16)) (4.8.0) Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from accelerate==0.20.3->-r requirements.txt (line 1)) (23.2) Requirement already satisfied: psutil in /usr/local/lib/python3.10/dist-packages (from accelerate==0.20.3->-r requirements.txt (line 1)) (5.9.7) Requirement already satisfied: pyyaml in /usr/local/lib/python3.10/dist-packages (from accelerate==0.20.3->-r requirements.txt (line 1)) (6.0.1) Requirement already satisfied: cython in /usr/local/lib/python3.10/dist-packages (from cuda-python==12.2.0->-r requirements.txt (line 4)) (3.0.7) Requirement already satisfied: Pillow in /usr/local/lib/python3.10/dist-packages (from diffusers==0.15.0->-r requirements.txt (line 5)) (10.1.0) Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from diffusers==0.15.0->-r requirements.txt (line 5)) (3.12.4) Requirement already satisfied: huggingface-hub>=0.13.2 in /usr/local/lib/python3.10/dist-packages (from diffusers==0.15.0->-r requirements.txt (line 5)) (0.20.1) Requirement already satisfied: importlib-metadata in /usr/lib/python3/dist-packages (from diffusers==0.15.0->-r requirements.txt (line 5)) (4.6.4) Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from diffusers==0.15.0->-r requirements.txt (line 5)) (2023.10.3) Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from diffusers==0.15.0->-r requirements.txt (line 5)) (2.31.0) Requirement already satisfied: tokenizers!=0.11.3,<0.14,>=0.11.1 in /usr/local/lib/python3.10/dist-packages (from transformers==4.33.1->-r requirements.txt (line 14)) (0.13.3) Requirement already satisfied: safetensors>=0.3.1 in /usr/local/lib/python3.10/dist-packages (from transformers==4.33.1->-r requirements.txt (line 14)) (0.4.0) Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.10/dist-packages (from transformers==4.33.1->-r requirements.txt (line 14)) (4.66.1) Requirement already satisfied: pyproject_hooks in /usr/local/lib/python3.10/dist-packages (from build->-r requirements.txt (line 2)) (1.0.0) Requirement already satisfied: tomli>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from build->-r requirements.txt (line 2)) (2.0.1) Requirement already satisfied: protobuf>=3.20.2 in /usr/local/lib/python3.10/dist-packages (from onnx>=1.12.0->-r requirements.txt (line 9)) (4.25.1) Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch->-r requirements.txt (line 13)) (1.12) Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch->-r requirements.txt (line 13)) (3.2) Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch->-r requirements.txt (line 13)) (3.1.2) Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from torch->-r requirements.txt (line 13)) (2023.9.2) Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch->-r requirements.txt (line 13)) (12.1.105) Requirement already satisfied: nvidia-cuda-runtime-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch->-r requirements.txt (line 13)) (12.1.105) Requirement already satisfied: nvidia-cuda-cupti-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch->-r requirements.txt (line 13)) (12.1.105) Requirement already satisfied: nvidia-cudnn-cu12==8.9.2.26 in /usr/local/lib/python3.10/dist-packages (from torch->-r requirements.txt (line 13)) (8.9.2.26) Requirement already satisfied: nvidia-cublas-cu12==12.1.3.1 in /usr/local/lib/python3.10/dist-packages (from torch->-r requirements.txt (line 13)) (12.1.3.1) Requirement already satisfied: nvidia-cufft-cu12==11.0.2.54 in /usr/local/lib/python3.10/dist-packages (from torch->-r requirements.txt (line 13)) (11.0.2.54) Requirement already satisfied: nvidia-curand-cu12==10.3.2.106 in /usr/local/lib/python3.10/dist-packages (from torch->-r requirements.txt (line 13)) (10.3.2.106) Requirement already satisfied: nvidia-cusolver-cu12==11.4.5.107 in /usr/local/lib/python3.10/dist-packages (from torch->-r requirements.txt (line 13)) (11.4.5.107) Requirement already satisfied: nvidia-cusparse-cu12==12.1.0.106 in /usr/local/lib/python3.10/dist-packages (from torch->-r requirements.txt (line 13)) (12.1.0.106) Requirement already satisfied: nvidia-nccl-cu12==2.18.1 in /usr/local/lib/python3.10/dist-packages (from torch->-r requirements.txt (line 13)) (2.18.1) Requirement already satisfied: nvidia-nvtx-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch->-r requirements.txt (line 13)) (12.1.105) Requirement already satisfied: triton==2.1.0 in /usr/local/lib/python3.10/dist-packages (from torch->-r requirements.txt (line 13)) (2.1.0) Requirement already satisfied: nvidia-nvjitlink-cu12 in /usr/local/lib/python3.10/dist-packages (from nvidia-cusolver-cu12==11.4.5.107->torch->-r requirements.txt (line 13)) (12.3.101) Requirement already satisfied: coloredlogs in /usr/local/lib/python3.10/dist-packages (from optimum->-r requirements.txt (line 16)) (15.0.1) Requirement already satisfied: dill in /usr/local/lib/python3.10/dist-packages (from evaluate->-r requirements.txt (line 17)) (0.3.7) Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from evaluate->-r requirements.txt (line 17)) (2.1.4) Requirement already satisfied: xxhash in /usr/local/lib/python3.10/dist-packages (from evaluate->-r requirements.txt (line 17)) (3.4.1) Requirement already satisfied: multiprocess in /usr/local/lib/python3.10/dist-packages (from evaluate->-r requirements.txt (line 17)) (0.70.15) Requirement already satisfied: responses<0.19 in /usr/local/lib/python3.10/dist-packages (from evaluate->-r requirements.txt (line 17)) (0.18.0) Requirement already satisfied: pyarrow>=8.0.0 in /usr/local/lib/python3.10/dist-packages (from datasets->-r requirements-dev.txt (line 2)) (14.0.2) Requirement already satisfied: pyarrow-hotfix in /usr/local/lib/python3.10/dist-packages (from datasets->-r requirements-dev.txt (line 2)) (0.6) Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from datasets->-r requirements-dev.txt (line 2)) (3.9.1) Requirement already satisfied: mypy-extensions>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from mypy->-r requirements-dev.txt (line 6)) (1.0.0) Requirement already satisfied: cfgv>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from pre-commit->-r requirements-dev.txt (line 8)) (3.4.0) Requirement already satisfied: identify>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from pre-commit->-r requirements-dev.txt (line 8)) (2.5.33) Requirement already satisfied: nodeenv>=0.11.1 in /usr/local/lib/python3.10/dist-packages (from pre-commit->-r requirements-dev.txt (line 8)) (1.8.0) Requirement already satisfied: virtualenv>=20.10.0 in /usr/local/lib/python3.10/dist-packages (from pre-commit->-r requirements-dev.txt (line 8)) (20.25.0) Requirement already satisfied: pytest>=4.6 in /usr/local/lib/python3.10/dist-packages (from pytest-cov->-r requirements-dev.txt (line 11)) (7.4.3) Requirement already satisfied: coverage>=5.2.1 in /usr/local/lib/python3.10/dist-packages (from coverage[toml]>=5.2.1->pytest-cov->-r requirements-dev.txt (line 11)) (7.4.0) Requirement already satisfied: py in /usr/local/lib/python3.10/dist-packages (from pytest-forked->-r requirements-dev.txt (line 12)) (1.11.0) Requirement already satisfied: execnet>=1.1 in /usr/local/lib/python3.10/dist-packages (from pytest-xdist->-r requirements-dev.txt (line 13)) (2.0.2) Requirement already satisfied: absl-py in /usr/local/lib/python3.10/dist-packages (from rouge_score->-r requirements-dev.txt (line 14)) (2.0.0) Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (from rouge_score->-r requirements-dev.txt (line 14)) (3.8.1) Requirement already satisfied: six>=1.14.0 in /usr/lib/python3/dist-packages (from rouge_score->-r requirements-dev.txt (line 14)) (1.16.0) Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets->-r requirements-dev.txt (line 2)) (23.1.0) Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets->-r requirements-dev.txt (line 2)) (6.0.4) Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets->-r requirements-dev.txt (line 2)) (1.9.4) Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets->-r requirements-dev.txt (line 2)) (1.4.1) Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets->-r requirements-dev.txt (line 2)) (1.3.1) Requirement already satisfied: async-timeout<5.0,>=4.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets->-r requirements-dev.txt (line 2)) (4.0.3) Requirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from nodeenv>=0.11.1->pre-commit->-r requirements-dev.txt (line 8)) (68.2.2) Requirement already satisfied: iniconfig in /usr/local/lib/python3.10/dist-packages (from pytest>=4.6->pytest-cov->-r requirements-dev.txt (line 11)) (2.0.0) Requirement already satisfied: pluggy<2.0,>=0.12 in /usr/local/lib/python3.10/dist-packages (from pytest>=4.6->pytest-cov->-r requirements-dev.txt (line 11)) (1.3.0) Requirement already satisfied: exceptiongroup>=1.0.0rc8 in /usr/local/lib/python3.10/dist-packages (from pytest>=4.6->pytest-cov->-r requirements-dev.txt (line 11)) (1.2.0) Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->diffusers==0.15.0->-r requirements.txt (line 5)) (3.3.0) Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->diffusers==0.15.0->-r requirements.txt (line 5)) (3.4) Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->diffusers==0.15.0->-r requirements.txt (line 5)) (2.0.7) Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->diffusers==0.15.0->-r requirements.txt (line 5)) (2023.7.22) Requirement already satisfied: distlib<1,>=0.3.7 in /usr/local/lib/python3.10/dist-packages (from virtualenv>=20.10.0->pre-commit->-r requirements-dev.txt (line 8)) (0.3.8) Requirement already satisfied: platformdirs<5,>=3.9.1 in /usr/local/lib/python3.10/dist-packages (from virtualenv>=20.10.0->pre-commit->-r requirements-dev.txt (line 8)) (4.1.0) Requirement already satisfied: humanfriendly>=9.1 in /usr/local/lib/python3.10/dist-packages (from coloredlogs->optimum->-r requirements.txt (line 16)) (10.0) Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch->-r requirements.txt (line 13)) (2.1.3) Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk->rouge_score->-r requirements-dev.txt (line 14)) (8.1.7) Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from nltk->rouge_score->-r requirements-dev.txt (line 14)) (1.3.2) Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas->evaluate->-r requirements.txt (line 17)) (2.8.2) Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->evaluate->-r requirements.txt (line 17)) (2023.3.post1) Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas->evaluate->-r requirements.txt (line 17)) (2023.3) Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch->-r requirements.txt (line 13)) (1.3.0) WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv [notice] A new release of pip is available: 23.3 -> 23.3.2 [notice] To update, run: python3 -m pip install --upgrade pip -- The CXX compiler identification is GNU 11.4.0 -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Check for working CXX compiler: /usr/bin/c++ - skipped -- Detecting CXX compile features -- Detecting CXX compile features - done -- NVTX is disabled -- Importing batch manager -- Building PyTorch -- Building Google tests -- Building benchmarks -- Looking for a CUDA compiler -- Looking for a CUDA compiler - NOTFOUND CMake Error at CMakeLists.txt:118 (message): No CUDA compiler found

-- Configuring incomplete, errors occurred! See also "{mydir}/TensorRT-LLM/cpp/build/CMakeFiles/CMakeOutput.log". See also "{mydir}/TensorRT-LLM/cpp/build/CMakeFiles/CMakeError.log". Traceback (most recent call last): File "{mydir}/TensorRT-LLM/./scripts/build_wheel.py", line 306, in main(**vars(args)) File "{mydir}/TensorRT-LLM/./scripts/build_wheel.py", line 160, in main build_run( File "/usr/lib/python3.10/subprocess.py", line 526, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command 'cmake -DCMAKE_BUILD_TYPE="Release" -DBUILD_PYT="ON" -DBUILD_PYBIND="OFF" -DTRT_LIB_DIR=/usr/local/tensorrt/targets/x86_64-linux-gnu/lib -DTRT_INCLUDE_DIR=/usr/local/tensorrt/include -S "{mydir}/TensorRT-LLM/cpp"' returned non-zero exit status 1

I need to build TensorRT-LLM again? then how I can build clear this error?

Shixiaowei02 commented 11 months ago
-- Looking for a CUDA compiler
-- Looking for a CUDA compiler - NOTFOUND
CMake Error at CMakeLists.txt:118 (message):
No CUDA compiler found

According the error message, please verify that CUDA is installed correctly. Thank you!

namang301 commented 11 months ago

Thanks for reply. after I did check my CUDA, when build TensorRT-LLM again, I got this error.

{skip logs...} [100%] Building CUDA object tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/weightOnlyBatchedGemv/weightOnlyBatchedGemvBs2Int8b.cu.o [100%] Building CUDA object tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/weightOnlyBatchedGemv/weightOnlyBatchedGemvBs4Int4b.cu.o [100%] Building CUDA object tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/weightOnlyBatchedGemv/weightOnlyBatchedGemvBs4Int8b.cu.o [100%] Building CUDA object tensorrt_llm/kernels/CMakeFiles/kernels_src.dir/weightOnlyBatchedGemv/weightOnlyBatchedGemvBs3Int8b.cu.o [100%] Built target layers_src [100%] Built target common_src [100%] Built target runtime_src [100%] Built target kernels_src [100%] Linking CXX static library libtensorrt_llm_static.a [100%] Built target tensorrt_llm_static [100%] Linking CXX shared library libtensorrt_llm.so /usr/bin/ld: ../../tensorrt_llm/batch_manager/x86_64-linux-gnu/libtensorrt_llm_batch_manager_static.pre_cxx11.a(kvCacheManager.cpp.o): in function tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager::getMaxNumTokens(tensorrt_llm::batch_manager::kv_cache_manager::KvCacheConfig const&, nvinfer1::DataType, tensorrt_llm::runtime::GptModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::runtime::BufferManager const&)': kvCacheManager.cpp:(.text+0x213e): undefined reference toompi_mpi_comm_world' collect2: error: ld returned 1 exit status gmake[3]: [tensorrt_llm/CMakeFiles/tensorrt_llm.dir/build.make:1223: tensorrt_llm/libtensorrt_llm.so] Error 1 gmake[2]: [CMakeFiles/Makefile2:727: tensorrt_llm/CMakeFiles/tensorrt_llm.dir/all] Error 2 gmake[1]: [CMakeFiles/Makefile2:734: tensorrt_llm/CMakeFiles/tensorrt_llm.dir/rule] Error 2 gmake: [Makefile:179: tensorrt_llm] Error 2 Traceback (most recent call last): File "{mydir}/TensorRT-LLM/./scripts/build_wheel.py", line 306, in main(**vars(args)) File "{mydir}/TensorRT-LLM/./scripts/build_wheel.py", line 164, in main build_run( File "/usr/lib/python3.10/subprocess.py", line 526, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command 'cmake --build . --config Release --parallel 255 --target tensorrt_llm tensorrt_llm_static nvinfer_plugin_tensorrt_llm th_common ' returned non-zero exit status 2.

I already checked #617, but I couldn't solve that. I think it already setting about mpi (/usr/local/mpi ( -> /opt/hcpx/ompi)) in "nvcr.io-nvidia-tritionserver-23.10-trtllm-python-py3". but when I build tensorRT-LLM, I had to use $ apt install mpich, because of installation of 'mpi4py'. and I had add below:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/mpi/lib
export OPAL_PREFIX=/opt/hpcx/ompi

maybe I guess I will need to change linking about openmpi, but I don't know how its can.. how I could solve that?

namang301 commented 11 months ago

@Shixiaowei02 Is there any way I can solve this problem?

Shixiaowei02 commented 11 months ago

Please try running these commands. Thank you!

# Pull and launch the Docker container
docker run --rm -it --entrypoint bash nvidia/cuda:12.2.2-devel-ubuntu22.04
# Install dependencies, TensorRT-LLM requires Python 3.10
apt-get update && apt-get -y install python3.10 python3-pip openmpi-bin libopenmpi-dev
# Install the latest version of TensorRT-LLM
pip3 install tensorrt_llm -U --extra-index-url https://pypi.nvidia.com
# Check installation
python3 -c "import tensorrt_llm; print(tensorrt_llm.__version__)"
Nam-ang commented 10 months ago

@Shixiaowei02 Thanks, I build my docker image again.

and I now meet new error, when I run below $ python3 ../run.py --max_output_len=50 \ --tokenizer_dir meta-llama/llama-2-7b-hf \ --engine_dir=./tmp/llama/out/

error message is:

... { skip error message } ...

[TensorRT-LLM][ERROR] tensorrt_llm::common::TllmException: [TensorRT-LLM][ERROR] Assertion failed: d == a + length (/src/tensorrt_llm/cpp/tensorrt_llm/plugins/gptAttentionCommon/gptAttentionCommon.cpp:434) 1 0x7f13c715d229 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x47229) [0x7f13c715d229] 2 0x7f13c715d473 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x47473) [0x7f13c715d473] 3 0x7f13c72268e3 tensorrt_llm::plugins::GPTAttentionPlugin::GPTAttentionPlugin(void const, unsigned long) + 19 4 0x7f13c7226962 tensorrt_llm::plugins::GPTAttentionPluginCreator::deserializePlugin(char const, void const, unsigned long) + 50 5 0x7f1416cbf8a6 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10d68a6) [0x7f1416cbf8a6] 6 0x7f1416cb766e /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10ce66e) [0x7f1416cb766e] 7 0x7f1416c52217 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x1069217) [0x7f1416c52217] 8 0x7f1416c5019e /usr/local/tensorrt/lib/libnvinfer.so.9(+0x106719e) [0x7f1416c5019e] 9 0x7f1416c67c2b /usr/local/tensorrt/lib/libnvinfer.so.9(+0x107ec2b) [0x7f1416c67c2b] 10 0x7f1416c6ae32 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x1081e32) [0x7f1416c6ae32] 11 0x7f1416c6b20c /usr/local/tensorrt/lib/libnvinfer.so.9(+0x108220c) [0x7f1416c6b20c] 12 0x7f1416c9e9b1 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10b59b1) [0x7f1416c9e9b1] 13 0x7f1416c9f777 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10b6777) [0x7f1416c9f777] 14 0x7f14256571a5 /usr/local/lib/python3.10/dist-packages/tensorrt/tensorrt.so(+0x571a5) [0x7f14256571a5] 15 0x7f1425643433 /usr/local/lib/python3.10/dist-packages/tensorrt/tensorrt.so(+0x43433) [0x7f1425643433] 16 0x560d65c4ee0e python(+0x15fe0e) [0x560d65c4ee0e] 17 0x560d65c455eb _PyObject_MakeTpCall + 603 18 0x560d65c5d7bb python(+0x16e7bb) [0x560d65c5d7bb] 19 0x560d65c3d8a2 _PyEval_EvalFrameDefault + 24914 20 0x560d65c4f70c _PyFunction_Vectorcall + 124 21 0x560d65c37f52 _PyEval_EvalFrameDefault + 2050 22 0x560d65c44784 _PyObject_FastCallDictTstate + 196 23 0x560d65c59744 python(+0x16a744) [0x560d65c59744] 24 0x560d65c4558c _PyObject_MakeTpCall + 508 25 0x560d65c3dc66 _PyEval_EvalFrameDefault + 25878 26 0x560d65c4f70c _PyFunction_Vectorcall + 124 27 0x560d65c4482d _PyObject_FastCallDictTstate + 365 28 0x560d65c59744 python(+0x16a744) [0x560d65c59744] 29 0x560d65c4558c _PyObject_MakeTpCall + 508 30 0x560d65c3e908 _PyEval_EvalFrameDefault + 29112 31 0x560d65c5d4e1 python(+0x16e4e1) [0x560d65c5d4e1] 32 0x560d65c5e192 PyObject_Call + 290 33 0x560d65c3a2c1 _PyEval_EvalFrameDefault + 11121 34 0x560d65c4f70c _PyFunction_Vectorcall + 124 35 0x560d65c37e0d _PyEval_EvalFrameDefault + 1725 36 0x560d65d28e56 python(+0x239e56) [0x560d65d28e56] 37 0x560d65d28cf6 PyEval_EvalCode + 134 38 0x560d65d537d8 python(+0x2647d8) [0x560d65d537d8] 39 0x560d65d4d0bb python(+0x25e0bb) [0x560d65d4d0bb] 40 0x560d65d53525 python(+0x264525) [0x560d65d53525] 41 0x560d65d52a08 _PyRun_SimpleFileObject + 424 42 0x560d65d52653 _PyRun_AnyFileObject + 67 43 0x560d65d4541e Py_RunMain + 702 44 0x560d65d1bcad Py_BytesMain + 45 45 0x7f1675747d90 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f1675747d90] 46 0x7f1675747e40 __libc_start_main + 128 47 0x560d65d1bba5 _start + 37 [instance-1902:17213:0:17213] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil)) ==== backtrace (tid: 17213) ==== 0 0x0000000000042520 __sigaction() ???:0 1 0x00000000010d475f createInferRuntime_INTERNAL() ???:0 2 0x000000000107dd42 getInferLibVersion() ???:0 3 0x00000000010808ec getInferLibVersion() ???:0 4 0x0000000001081e32 getInferLibVersion() ???:0 5 0x000000000108220c getInferLibVersion() ???:0 6 0x00000000010b59b1 createInferRuntime_INTERNAL() ???:0 7 0x00000000010b6777 createInferRuntime_INTERNAL() ???:0 8 0x00000000000571a5 pybind11::cpp_function::initialize<tensorrt::lambdas::{lambda(nvinfer1::IRuntime&, pybind11::buffer&)#8} const&, nvinfer1::ICudaEngine, nvinfer1::IRuntime&, pybind11::buffer&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, char const, pybind11::call_guard, pybind11::keep_alive<0ul, 1ul> >(tensorrt::lambdas::{lambda(nvinfer1::IRuntime&, pybind11::buffer&)#8} const&, nvinfer1::ICudaEngine ()(nvinfer1::IRuntime&, pybind11::buffer&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, char const const&, pybind11::call_guard const&, pybind11::keep_alive<0ul, 1ul> const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN() pyCore.cpp:0 9 0x0000000000043433 pybind11::cpp_function::dispatcher() :0 10 0x000000000015fe0e PyObject_CallFunctionObjArgs() ???:0 11 0x00000000001565eb _PyObject_MakeTpCall() ???:0 12 0x000000000016e7bb PyMethod_New() ???:0 13 0x000000000014e8a2 _PyEval_EvalFrameDefault() ???:0 14 0x000000000016070c _PyFunction_Vectorcall() ???:0 15 0x0000000000148f52 _PyEval_EvalFrameDefault() ???:0 16 0x0000000000155784 _PyObject_FastCallDictTstate() ???:0 17 0x000000000016a744 _PyStack_AsDict() ???:0 18 0x000000000015658c _PyObject_MakeTpCall() ???:0 19 0x000000000014ec66 _PyEval_EvalFrameDefault() ???:0 20 0x000000000016070c _PyFunction_Vectorcall() ???:0 21 0x000000000015582d _PyObject_FastCallDictTstate() ???:0 22 0x000000000016a744 _PyStack_AsDict() ???:0 23 0x000000000015658c _PyObject_MakeTpCall() ???:0 24 0x000000000014f908 _PyEval_EvalFrameDefault() ???:0 25 0x000000000016e4e1 PyMethod_New() ???:0 26 0x000000000016f192 PyObject_Call() ???:0 27 0x000000000014b2c1 _PyEval_EvalFrameDefault() ???:0 28 0x000000000016070c _PyFunction_Vectorcall() ???:0 29 0x0000000000148e0d _PyEval_EvalFrameDefault() ???:0 30 0x0000000000239e56 PyEval_EvalCode() ???:0 31 0x0000000000239cf6 PyEval_EvalCode() ???:0 32 0x00000000002647d8 PyUnicode_Tailmatch() ???:0 33 0x000000000025e0bb PyInit__collections() ???:0 34 0x0000000000264525 PyUnicode_Tailmatch() ???:0 35 0x0000000000263a08 _PyRun_SimpleFileObject() ???:0 36 0x0000000000263653 _PyRun_AnyFileObject() ???:0 37 0x000000000025641e Py_RunMain() ???:0 38 0x000000000022ccad Py_BytesMain() ???:0 39 0x0000000000029d90 libc_init_first() ???:0 40 0x0000000000029e40 libc_start_main() ???:0 41 0x000000000022cba5 _start() ???:0

Segmentation fault (core dumped)

hello-11 commented 1 week ago

@namang301 Do you still have the problem? If not, we will close it soon.