NVIDIA / GenerativeAIExamples

Generative AI reference workflows optimized for accelerated infrastructure and microservice architecture.
https://nvidia.github.io/GenerativeAIExamples/latest/index.html
Apache License 2.0
1.77k stars 294 forks source link

triton-inference-server cannot be started #129

Open tuninger opened 1 month ago

tuninger commented 1 month ago

NAME READY STATUS RESTARTS AGE jupyter-notebook-server-5f785cd7c8-x8qd6 1/1 Running 0 45m llm-playground-7d8c999487-fgmj5 1/1 Running 0 45m milvu-etcd-7cf545456f-m8q9m 1/1 Running 0 45m milvus-minio-7ff64c76f-4njkz 1/1 Running 0 45m milvus-standalone-7479bf9ddd-n6s6f 1/1 Running 0 45m query-router-65c6f864ff-fstkb 1/1 Running 0 45m triton-inference-server-7cd84c8f4b-wzsk9 0/1 CrashLoopBackOff 8 (18s ago) 23m

[triton-inference-server-7cd84c8f4b-wzsk9:30 :0:30] Caught signal 7 (Bus error: nonexistent physical address) backtrace (tid: 30) 0 0x0000000000042520 sigaction() ???:0 1 0x000000000001678b uct_iface_mp_chunk_alloc_inner() /build-result/src/hpcx-v2.15-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.17-x86_64/ucx-d799cfdd2293cc72206aa8188deb8e7d22c82c9f/src/uct/base/uct_mem.c:469 2 0x000000000001678b uct_iface_mp_chunk_alloc() /build-result/src/hpcx-v2.15-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.17-x86_64/ucx-d799cfdd2293cc72206aa8188deb8e7d22c82c9f/src/uct/base/uct_mem.c:443 3 0x000000000005407b ucs_mpool_grow() /build-result/src/hpcx-v2.15-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.17-x86_64/ucx-d799cfdd2293cc72206aa8188deb8e7d22c82c9f/src/ucs/datastruct/mpool.c:266 4 0x00000000000542c9 ucs_mpool_get_grow() /build-result/src/hpcx-v2.15-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.17-x86_64/ucx-d799cfdd2293cc72206aa8188deb8e7d22c82c9f/src/ucs/datastruct/mpool.c:312 5 0x000000000001b488 uct_mm_iface_t_init() /build-result/src/hpcx-v2.15-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.17-x86_64/ucx-d799cfdd2293cc72206aa8188deb8e7d22c82c9f/src/uct/sm/mm/base/mm_iface.c:822 6 0x000000000001b9f2 uct_mm_iface_t_init() /build-result/src/hpcx-v2.15-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.17-x86_64/ucx-d799cfdd2293cc72206aa8188deb8e7d22c82c9f/src/uct/sm/mm/base/mm_iface.c:720 7 0x0000000000014f02 uct_iface_open() /build-result/src/hpcx-v2.15-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.17-x86_64/ucx-d799cfdd2293cc72206aa8188deb8e7d22c82c9f/src/uct/base/uct_md.c:284 8 0x000000000004a017 ucp_worker_iface_open() /build-result/src/hpcx-v2.15-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.17-x86_64/ucx-d799cfdd2293cc72206aa8188deb8e7d22c82c9f/src/ucp/core/ucp_worker.c:1357 9 0x000000000004afe0 ucp_worker_add_resource_ifaces() /build-result/src/hpcx-v2.15-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.17-x86_64/ucx-d799cfdd2293cc72206aa8188deb8e7d22c82c9f/src/ucp/core/ucp_worker.c:1101 10 0x000000000004d2db ucp_worker_create() /build-result/src/hpcx-v2.15-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.17-x86_64/ucx-d799cfdd2293cc72206aa8188deb8e7d22c82c9f/src/ucp/core/ucp_worker.c:2441 11 0x000000000000702f mca_pml_ucx_init() /build-result/src/hpcx-v2.15-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.17-x86_64/ompi-5980bac63337537bf34556f95ee7511778de44a5/ompi/mca/pml/ucx/pml_ucx.c:306 12 0x00000000000093a5 mca_pml_ucx_component_init() /build-result/src/hpcx-v2.15-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.17-x86_64/ompi-5980bac63337537bf34556f95ee7511778de44a5/ompi/mca/pml/ucx/pml_ucx_component.c:136 13 0x00000000000c7022 mca_pml_base_select() /build-result/src/hpcx-v2.15-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.17-x86_64/ompi-5980bac63337537bf34556f95ee7511778de44a5/ompi/mca/pml/base/pml_base_select.c:127 14 0x00000000000d01c9 ompi_mpi_init() /build-result/src/hpcx-v2.15-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.17-x86_64/ompi-5980bac63337537bf34556f95ee7511778de44a5/ompi/runtime/ompi_mpi_init.c:647 15 0x0000000000075899 PMPI_Init_thread() /build-result/src/hpcx-v2.15-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.17-x86_64/ompi-5980bac63337537bf34556f95ee7511778de44a5/ompi/mpi/c/profile/pinit_thread.c:69 16 0x00000000000327a8 pyx_f_6mpi4py_3MPI_bootstrap() /tmp/pip-install-05lukizf/mpi4py_8cc4cad65d414a8995a9d1c890fac173/src/mpi4py.MPI.c:8115 17 0x00000000000327a8 __pyx_pymod_exec_MPI() /tmp/pip-install-05lukizf/mpi4py_8cc4cad65d414a8995a9d1c890fac173/src/mpi4py.MPI.c:176976 18 0x000000000023b2d3 PyModule_ExecDef() ???:0 19 0x000000000023bda0 PyInit__thread() ???:0 20 0x000000000015f854 PyObject_GenericGetAttr() ???:0 21 0x000000000014b2c1 _PyEval_EvalFrameDefault() ???:0 22 0x000000000016070c _PyFunction_Vectorcall() ???:0 23 0x000000000014e8a2 _PyEval_EvalFrameDefault() ???:0 24 0x000000000016070c _PyFunction_Vectorcall() ???:0 25 0x0000000000148f52 _PyEval_EvalFrameDefault() ???:0 26 0x000000000016070c _PyFunction_Vectorcall() ???:0 27 0x0000000000148e0d _PyEval_EvalFrameDefault() ???:0 28 0x000000000016070c _PyFunction_Vectorcall() ???:0 29 0x0000000000148e0d _PyEval_EvalFrameDefault() ???:0 30 0x000000000016070c _PyFunction_Vectorcall() ???:0 31 0x000000000015fb24 PyObject_CallFunctionObjArgs() ???:0 32 0x000000000023f4af _PyObject_CallMethodIdObjArgs() ???:0 33 0x00000000001740ca PyImport_ImportModuleLevelObject() ???:0 34 0x0000000000184458 PyImport_Import() ???:0 35 0x000000000015fe0e PyObject_CallFunctionObjArgs() ???:0 36 0x000000000016f12b PyObject_Call() ???:0 37 0x000000000014b2c1 _PyEval_EvalFrameDefault() ???:0 38 0x000000000016070c _PyFunction_Vectorcall() ???:0 39 0x0000000000148e0d _PyEval_EvalFrameDefault() ???:0 40 0x000000000016070c _PyFunction_Vectorcall() ???:0 41 0x000000000015fb24 PyObject_CallFunctionObjArgs() ???:0 42 0x000000000023f4af _PyObject_CallMethodIdObjArgs() ???:0 43 0x0000000000174cda PyImport_ImportModuleLevelObject() ???:0 44 0x000000000014b9e5 _PyEval_EvalFrameDefault() ???:0 45 0x0000000000239e56 PyEval_EvalCode() ???:0 46 0x0000000000239cf6 PyEval_EvalCode() ???:0 47 0x000000000023fb0d PyFrozenSet_New() ???:0 48 0x0000000000160969 PyCell_New() ???:0 49 0x000000000014b2c1 _PyEval_EvalFrameDefault() ???:0 50 0x000000000016070c _PyFunction_Vectorcall() ???:0 51 0x000000000014e8a2 _PyEval_EvalFrameDefault() ???:0 52 0x000000000016070c _PyFunction_Vectorcall() ???:0 53 0x0000000000148f52 _PyEval_EvalFrameDefault() ???:0 54 0x000000000016070c _PyFunction_Vectorcall() ???:0 55 0x0000000000148e0d _PyEval_EvalFrameDefault() ???:0 56 0x000000000016070c _PyFunction_Vectorcall() ???:0

[triton-inference-server-7cd84c8f4b-wzsk9:00030] Process received signal [triton-inference-server-7cd84c8f4b-wzsk9:00030] Signal: Bus error (7) [triton-inference-server-7cd84c8f4b-wzsk9:00030] Signal code: (-6) [triton-inference-server-7cd84c8f4b-wzsk9:00030] Failing at address: 0x1e [triton-inference-server-7cd84c8f4b-wzsk9:00030] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f9d7caa7520] [triton-inference-server-7cd84c8f4b-wzsk9:00030] [ 1] /opt/hpcx/ucx/lib/libuct.so.0(uct_iface_mp_chunk_alloc+0x7b)[0x7f9d3689178b] [triton-inference-server-7cd84c8f4b-wzsk9:00030] [ 2] /opt/hpcx/ucx/lib/libucs.so.0(ucs_mpool_grow+0x7b)[0x7f9d3691607b] [triton-inference-server-7cd84c8f4b-wzsk9:00030] [ 3] /opt/hpcx/ucx/lib/libucs.so.0(ucs_mpool_get_grow+0x19)[0x7f9d369162c9] [triton-inference-server-7cd84c8f4b-wzsk9:00030] [ 4] /opt/hpcx/ucx/lib/libuct.so.0(+0x1b488)[0x7f9d36896488] [triton-inference-server-7cd84c8f4b-wzsk9:00030] [ 5] /opt/hpcx/ucx/lib/libuct.so.0(uct_mm_iface_t_new+0xb2)[0x7f9d368969f2] [triton-inference-server-7cd84c8f4b-wzsk9:00030] [ 6] /opt/hpcx/ucx/lib/libuct.so.0(uct_iface_open+0xe2)[0x7f9d3688ff02] [triton-inference-server-7cd84c8f4b-wzsk9:00030] [ 7] /opt/hpcx/ucx/lib/libucp.so.0(ucp_worker_iface_open+0x317)[0x7f9d36a93017] [triton-inference-server-7cd84c8f4b-wzsk9:00030] [ 8] /opt/hpcx/ucx/lib/libucp.so.0(+0x4afe0)[0x7f9d36a93fe0] [triton-inference-server-7cd84c8f4b-wzsk9:00030] [ 9] /opt/hpcx/ucx/lib/libucp.so.0(ucp_worker_create+0x7cb)[0x7f9d36a962db] [triton-inference-server-7cd84c8f4b-wzsk9:00030] [10] /opt/hpcx/ompi/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0x9f)[0x7f9d36b2f02f] [triton-inference-server-7cd84c8f4b-wzsk9:00030] [11] /opt/hpcx/ompi/lib/openmpi/mca_pml_ucx.so(+0x93a5)[0x7f9d36b313a5] [triton-inference-server-7cd84c8f4b-wzsk9:00030] [12] /opt/hpcx/ompi/lib/libmpi.so.40(mca_pml_base_select+0x1e2)[0x7f9c1bc35022] [triton-inference-server-7cd84c8f4b-wzsk9:00030] [13] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_mpi_init+0x6c9)[0x7f9c1bc3e1c9] [triton-inference-server-7cd84c8f4b-wzsk9:00030] [14] /opt/hpcx/ompi/lib/libmpi.so.40(PMPI_Init_thread+0x79)[0x7f9c1bbe3899] [triton-inference-server-7cd84c8f4b-wzsk9:00030] [15] /usr/local/lib/python3.10/dist-packages/mpi4py/MPI.cpython-310-x86_64-linux-gnu.so(+0x327a8)[0x7f9c1bcbf7a8] [triton-inference-server-7cd84c8f4b-wzsk9:00030] [16] /usr/bin/python3(PyModule_ExecDef+0x73)[0x55f3c471e2d3] [triton-inference-server-7cd84c8f4b-wzsk9:00030] [17] /usr/bin/python3(+0x23bda0)[0x55f3c471eda0] [triton-inference-server-7cd84c8f4b-wzsk9:00030] [18] /usr/bin/python3(+0x15f854)[0x55f3c4642854] [triton-inference-server-7cd84c8f4b-wzsk9:00030] [19] /usr/bin/python3(_PyEval_EvalFrameDefault+0x2b71)[0x55f3c462e2c1] [triton-inference-server-7cd84c8f4b-wzsk9:00030] [20] /usr/bin/python3(_PyFunction_Vectorcall+0x7c)[0x55f3c464370c] [triton-inference-server-7cd84c8f4b-wzsk9:00030] [21] /usr/bin/python3(_PyEval_EvalFrameDefault+0x6152)[0x55f3c46318a2] [triton-inference-server-7cd84c8f4b-wzsk9:00030] [22] /usr/bin/python3(_PyFunction_Vectorcall+0x7c)[0x55f3c464370c] [triton-inference-server-7cd84c8f4b-wzsk9:00030] [23] /usr/bin/python3(_PyEval_EvalFrameDefault+0x802)[0x55f3c462bf52] [triton-inference-server-7cd84c8f4b-wzsk9:00030] [24] /usr/bin/python3(_PyFunction_Vectorcall+0x7c)[0x55f3c464370c] [triton-inference-server-7cd84c8f4b-wzsk9:00030] [25] /usr/bin/python3(_PyEval_EvalFrameDefault+0x6bd)[0x55f3c462be0d] [triton-inference-server-7cd84c8f4b-wzsk9:00030] [26] /usr/bin/python3(_PyFunction_Vectorcall+0x7c)[0x55f3c464370c] [triton-inference-server-7cd84c8f4b-wzsk9:00030] [27] /usr/bin/python3(_PyEval_EvalFrameDefault+0x6bd)[0x55f3c462be0d] [triton-inference-server-7cd84c8f4b-wzsk9:00030] [28] /usr/bin/python3(_PyFunction_Vectorcall+0x7c)[0x55f3c464370c] [triton-inference-server-7cd84c8f4b-wzsk9:00030] [29] /usr/bin/python3(+0x15fb24)[0x55f3c4642b24] [triton-inference-server-7cd84c8f4b-wzsk9:00030] End of error message [23] May 29 04:16:21 [ ERROR] - main - TensorRT conversion returned a non-zero exit code.