Open bpickrel opened 1 year ago
Use the following target models for testing:
resnet50 Bert distilgpt2
Babystep the process for this and see what we need/can leverage from existing backends/Execution provider
Latest update:
git@github.com:triton-inference-server/onnxruntime_backend.git
which uses a cmake build, no success yet; working on itgit clone git@github.com:triton-inference-server/server.git
cd ~/Triton-server/server/docs/examples
./fetch_models.sh
_Note: modelrepository directory != model-repository. We want the one with the underscore.
nano model_repository/densenet_onnx/config.pbtxt
Add the line backend: "onnxruntime"
In a different console, same directory, run a prebuilt Docker image of the server
docker run --rm --net=host -v ${PWD}/model_repository:/models nvcr.io/nvidia/tritonserver:23.04-py3 tritonserver --model-repository=/models
docker run -it --rm --net=host nvcr.io/nvidia/tritonserver:22.07-py3-sdk /workspace/install/bin/image_client -m densenet_onnx -c 3 -s INCEPTION /workspace/images/mug.jpg
The above doesn't go in the order of Ted's earlier note. I'm running a prebuilt Docker image of a server before having built my own server.
Here's what the onnxruntime shared libraries look like, as installed in that server Docker I used above:
root@home-tower:/opt/tritonserver/backends# ll onnxruntime/
total 507608
drwxrwxrwx 3 triton-server triton-server 4096 Apr 18 2023 ./
drwxrwxrwx 13 triton-server triton-server 4096 Apr 18 2023 ../
-rw-rw-rw- 1 triton-server triton-server 1073 Apr 18 2023 LICENSE
drwxrwxrwx 2 triton-server triton-server 4096 Apr 18 2023 LICENSE.openvino/
-rw-rw-rw- 1 triton-server triton-server 21015528 Apr 18 2023 libonnxruntime.so
-rw-rw-rw- 1 triton-server triton-server 420751256 Apr 18 2023 libonnxruntime_providers_cuda.so
-rw-rw-rw- 1 triton-server triton-server 559152 Apr 18 2023 libonnxruntime_providers_openvino.so
-rw-rw-rw- 1 triton-server triton-server 15960 Apr 18 2023 libonnxruntime_providers_shared.so
-rw-rw-rw- 1 triton-server triton-server 548472 Apr 18 2023 libonnxruntime_providers_tensorrt.so
-rw-rw-rw- 1 triton-server triton-server 12953944 Apr 18 2023 libopenvino.so
-rw-rw-rw- 1 triton-server triton-server 288352 Apr 18 2023 libopenvino_c.so
-rw-rw-rw- 1 triton-server triton-server 32005816 Apr 18 2023 libopenvino_intel_cpu_plugin.so
-rw-rw-rw- 1 triton-server triton-server 332096 Apr 18 2023 libopenvino_ir_frontend.so
-rw-rw-rw- 1 triton-server triton-server 3684352 Apr 18 2023 libopenvino_onnx_frontend.so
lrwxrwxrwx 1 triton-server triton-server 11 Apr 18 2023 libtbb.so -> libtbb.so.2
-rw-rw-rw- 1 triton-server triton-server 438832 Apr 18 2023 libtbb.so.2
-rw-rw-rw- 1 triton-server triton-server 689616 Apr 18 2023 libtriton_onnxruntime.so
-rwxrwxrwx 1 triton-server triton-server 22923312 Apr 18 2023 onnx_test_runner*
-rwxrwxrwx 1 triton-server triton-server 3529192 Apr 18 2023 onnxruntime_perf_test*
-rw-rw-rw- 1 triton-server triton-server 7 Apr 18 2023 ort_onnx_version.txt
-rw-rw-rw- 1 triton-server triton-server 1056 Apr 18 2023 plugins.xml
I'll need to take this over. It looks like what Brian's done works, pulls in Onnxruntime as a backend and then adds the tensorRT/Cuda side. I'll poke around this week/next to get this running with our flavor of Onnxruntime with the ROCm/ MIGraphX EPs.
Got changes generating and building a dockerfile @bpickrel . One I've got this building I'll pass this back to you to generate the docker and then run it before passing this through another inference.
Changes are up on my fork: https://github.com/TedThemistokleous/onnxruntime_backend/blob/add_migraphx_rocm_onnxrt_eps/tools/gen_ort_dockerfile.py
I found that we have upstream triton images for ROCm we can leverage which I'm currently building.
https://hub.docker.com/r/rocm/oai-triton/tags
So to run the script you just need to run the following off ROCm 5.7 dockerfile with all the triton inference server pieces
python3 tools/gen_ort_dockerfile.py --migraphx-home=/opt/rocm --ort-migraphx --rocm-home=/opt/rocm/ --rocm-version=5.7 --enable-rocm --migraphx-version=rocm-5.7.1 --ort-version=1.7.0 --output=migx_rocm_triton_inf.dockerfile --triton-container=rocm/oai-triton:preview_2023-11-29_182
then
docker build -t migx_rocm_tritron -f migx_rocm_triton_inf.dockerfile .
That should create the docker you would need to run an inference similar to the previous example
Hitting a failure when attempting to build onnxruntime this way. Currently looking into this.
Finally some good news after debugging. @bpickrel @causten
I'm able to build a container off the generated dockerfile and get the proper hooks/links pieces to work for an MIGraphX + ROCm
Needed to rework some of the automation used to build this onnxruntime backend which uses a different repo than what we use to perform onnxruntime builds. Had to change the rel-XXXX to just building Onnxruntime main
Needed to add additional pieces from DLM/CI dockerfiles to get Onnxruntime to build
It looks like the built container containers two binaries, one for perf and one for test.
All the library shared objects are there too after popping into the container to take a look
libonnxruntime.so libonnxruntime.so.main libonnxruntime_providers_migraphx.so libonnxruntime_providers_rocm.so libonnxruntime_providers_shared.so
root@aus-navi3x-02:/opt/onnxruntime/lib#
There's a bin folder that seems to contain binaries we'd use to do the perf. Output of these seems interesting, maybe I should also add hooks/piecs like tensorRT here.
root@aus-navi3x-02:/opt/onnxruntime/bin# ./onnxruntime_perf_test
perf_test [options...] model_path [result_file]
Options:
-m [test_mode]: Specifies the test mode. Value could be 'duration' or 'times'.
Provide 'duration' to run the test for a fix duration, and 'times' to repeated for a certain times.
-M: Disable memory pattern.
-A: Disable memory arena
-I: Generate tensor input binding (Free dimensions are treated as 1.)
-c [parallel runs]: Specifies the (max) number of runs to invoke simultaneously. Default:1.
-e [cpu|cuda|dnnl|tensorrt|openvino|dml|acl|nnapi|coreml|qnn|snpe|rocm|migraphx|xnnpack|vitisai]: Specifies the provider 'cpu','cuda','dnnl','tensorrt', 'openvino', 'dml', 'acl', 'nnapi', 'coreml', 'qnn', 'snpe', 'rocm', 'migraphx', 'xnnpack' or 'vitisai'. Default:'cpu'.
-b [tf|ort]: backend to use. Default:ort
-r [repeated_times]: Specifies the repeated times if running in 'times' test mode.Default:1000.
-t [seconds_to_run]: Specifies the seconds to run for 'duration' mode. Default:600.
-p [profile_file]: Specifies the profile name to enable profiling and dump the profile data to the file.
-s: Show statistics result, like P75, P90. If no result_file provided this defaults to on.
-S: Given random seed, to produce the same input data. This defaults to -1(no initialize).
-v: Show verbose information.
-x [intra_op_num_threads]: Sets the number of threads used to parallelize the execution within nodes, A value of 0 means ORT will pick a default. Must >=0.
-y [inter_op_num_threads]: Sets the number of threads used to parallelize the execution of the graph (across nodes), A value of 0 means ORT will pick a default. Must >=0.
-f [free_dimension_override]: Specifies a free dimension by name to override to a specific value for performance optimization. Syntax is [dimension_name:override_value]. override_value must > 0
-F [free_dimension_override]: Specifies a free dimension by denotation to override to a specific value for performance optimization. Syntax is [dimension_denotation:override_value]. override_value must > 0
-P: Use parallel executor instead of sequential executor.
-o [optimization level]: Default is 99 (all). Valid values are 0 (disable), 1 (basic), 2 (extended), 99 (all).
Please see onnxruntime_c_api.h (enum GraphOptimizationLevel) for the full list of all optimization levels.
-u [optimized_model_path]: Specify the optimized model path for saving.
-d [CUDA only][cudnn_conv_algorithm]: Specify CUDNN convolution algorithms: 0(benchmark), 1(heuristic), 2(default).
-q [CUDA only] use separate stream for copy.
-z: Set denormal as zero. When turning on this option reduces latency dramatically, a model may have denormals.
-i: Specify EP specific runtime options as key value pairs. Different runtime options available are:
[OpenVINO only] [device_type]: Overrides the accelerator hardware type and precision with these values at runtime.
[OpenVINO only] [device_id]: Selects a particular hardware device for inference.
[OpenVINO only] [enable_npu_fast_compile]: Optionally enabled to speeds up the model's compilation on NPU device targets.
[OpenVINO only] [num_of_threads]: Overrides the accelerator hardware type and precision with these values at runtime.
[OpenVINO only] [cache_dir]: Explicitly specify the path to dump and load the blobs(Model caching) or cl_cache (Kernel Caching) files feature. If blob files are already present, it will be directly loaded.
[OpenVINO only] [enable_opencl_throttling]: Enables OpenCL queue throttling for GPU device(Reduces the CPU Utilization while using GPU)
[QNN only] [backend_path]: QNN backend path. e.g '/folderpath/libQnnHtp.so', '/folderpath/libQnnCpu.so'.
[QNN only] [profiling_level]: QNN profiling level, options: 'basic', 'detailed', default 'off'.
[QNN only] [rpc_control_latency]: QNN rpc control latency. default to 10.
[QNN only] [vtcm_mb]: QNN VTCM size in MB. default to 0(not set).
[QNN only] [htp_performance_mode]: QNN performance mode, options: 'burst', 'balanced', 'default', 'high_performance',
'high_power_saver', 'low_balanced', 'low_power_saver', 'power_saver', 'sustained_high_performance'. Default to 'default'.
[QNN only] [qnn_context_priority]: QNN context priority, options: 'low', 'normal', 'normal_high', 'high'. Default to 'normal'.
[QNN only] [qnn_saver_path]: QNN Saver backend path. e.g '/folderpath/libQnnSaver.so'.
[QNN only] [htp_graph_finalization_optimization_mode]: QNN graph finalization optimization mode, options:
'0', '1', '2', '3', default is '0'.
[Usage]: -e <provider_name> -i '<key1>|<value1> <key2>|<value2>'
[Example] [For OpenVINO EP] -e openvino -i "device_type|CPU_FP32 enable_npu_fast_compile|true num_of_threads|5 enable_opencl_throttling|true cache_dir|"<path>""
[Example] [For QNN EP] -e qnn -i "backend_path|/folderpath/libQnnCpu.so"
[TensorRT only] [trt_max_partition_iterations]: Maximum iterations for TensorRT parser to get capability.
[TensorRT only] [trt_min_subgraph_size]: Minimum size of TensorRT subgraphs.
[TensorRT only] [trt_max_workspace_size]: Set TensorRT maximum workspace size in byte.
[TensorRT only] [trt_fp16_enable]: Enable TensorRT FP16 precision.
[TensorRT only] [trt_int8_enable]: Enable TensorRT INT8 precision.
[TensorRT only] [trt_int8_calibration_table_name]: Specify INT8 calibration table name.
[TensorRT only] [trt_int8_use_native_calibration_table]: Use Native TensorRT calibration table.
[TensorRT only] [trt_dla_enable]: Enable DLA in Jetson device.
[TensorRT only] [trt_dla_core]: DLA core number.
[TensorRT only] [trt_dump_subgraphs]: Dump TRT subgraph to onnx model.
[TensorRT only] [trt_engine_cache_enable]: Enable engine caching.
[TensorRT only] [trt_engine_cache_path]: Specify engine cache path.
[TensorRT only] [trt_force_sequential_engine_build]: Force TensorRT engines to be built sequentially.
[TensorRT only] [trt_context_memory_sharing_enable]: Enable TensorRT context memory sharing between subgraphs.
[TensorRT only] [trt_layer_norm_fp32_fallback]: Force Pow + Reduce ops in layer norm to run in FP32 to avoid overflow.
[Usage]: -e <provider_name> -i '<key1>|<value1> <key2>|<value2>'
[Example] [For TensorRT EP] -e tensorrt -i 'trt_fp16_enable|true trt_int8_enable|true trt_int8_calibration_table_name|calibration.flatbuffers trt_int8_use_native_calibration_table|false trt_force_sequential_engine_build|false'
[NNAPI only] [NNAPI_FLAG_USE_FP16]: Use fp16 relaxation in NNAPI EP..
[NNAPI only] [NNAPI_FLAG_USE_NCHW]: Use the NCHW layout in NNAPI EP.
[NNAPI only] [NNAPI_FLAG_CPU_DISABLED]: Prevent NNAPI from using CPU devices.
[NNAPI only] [NNAPI_FLAG_CPU_ONLY]: Using CPU only in NNAPI EP.
[Usage]: -e <provider_name> -i '<key1> <key2>'
[Example] [For NNAPI EP] -e nnapi -i " NNAPI_FLAG_USE_FP16 NNAPI_FLAG_USE_NCHW NNAPI_FLAG_CPU_DISABLED "
[SNPE only] [runtime]: SNPE runtime, options: 'CPU', 'GPU', 'GPU_FLOAT16', 'DSP', 'AIP_FIXED_TF'.
[SNPE only] [priority]: execution priority, options: 'low', 'normal'.
[SNPE only] [buffer_type]: options: 'TF8', 'TF16', 'UINT8', 'FLOAT', 'ITENSOR'. default: ITENSOR'.
[SNPE only] [enable_init_cache]: enable SNPE init caching feature, set to 1 to enabled it. Disabled by default.
[Usage]: -e <provider_name> -i '<key1>|<value1> <key2>|<value2>'
[Example] [For SNPE EP] -e snpe -i "runtime|CPU priority|low"
-T [Set intra op thread affinities]: Specify intra op thread affinity string
[Example]: -T 1,2;3,4;5,6 or -T 1-2;3-4;5-6
Use semicolon to separate configuration between threads.
E.g. 1,2;3,4;5,6 specifies affinities for three threads, the first thread will be attached to the first and second logical processor.
The number of affinities must be equal to intra_op_num_threads - 1
-D [Disable thread spinning]: disable spinning entirely for thread owned by onnxruntime intra-op thread pool.
-Z [Force thread to stop spinning between runs]: disallow thread from spinning during runs to reduce cpu usage.
-h: help
I've build a container on aus-navi3x-02.amd.com, named migx_rocm_tritron
You should be able to just docker into it with
docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined migx_rocm_tritron
If you want to pick another system checkout my branch and then run the following to generate a new dockerfile
python3 tools/gen_ort_dockerfile.py --ort-migraphx --rocm-home=/opt/rocm/ --rocm-version=6.0 --enable-rocm --migraphx-version=develop --ort-version=main --output=migx_rocm_triton_inf.dockerfile --triton-container=compute-artifactory.amd.com:5000/rocm-plus-docker/framework/compute-rocm-rel-6.0:88_ubuntu22.04_py3.10_pytorch_release-2.1_011de5c
This uses build 88 of ROCm 6.0, and builds MIGraphX from Develop and Onnxruntime off main using the upstream Microsoft repo.
Upstream changes are here: https://github.com/TedThemistokleous/onnxruntime_backend/blob/add_migraphx_rocm_onnxrt_eps/tools/gen_ort_dockerfile.py
Branch name is add_migraphx_rocm_onnxrt_eps
Does your Docker contain a Triton server? I thought this was a replacement for the example docker image that had the server included.
I thought it does but it looks like I'm wrong here. I did see shared binaries of MIgraphX, Onnxrt and MIGraphX & ROCm EPs with some other scripts. Taking a look at the front end part of triton and an initial read, I we need to enable/a hook added via reading their repo. It sounds like we were using the Nvidia front end in the example instead of just using Onnxruntime + building the back end support.
"By default, build.py does not enable any of Triton's optional features but you can enable all features, backends, and repository agents with the --enable-all flag. The -v flag turns on verbose output."
D'oh!
Looks like wishful thinking on my part hoping we could just invoke the one container. I sense the Docker backend build script generated is then leveraged by the frontend to add the missing components get built in. I'll have to dig more on the front end unless you see different hooks for onnxruntime in the main repo.
Looks like we need to invoke the container build done in the backend to the front end server by selecting things. The server builds the front end and then other components from the initial repo. all seems to be done through cmake which then through a series of flags, leverages the build script build.py from the server repo.
I've pushed up the changes to the onnxruntime_backend (pr 231 in that repo) and from the triton inference serve repo I'm invoking teh following after adding changes to their cmake script to incorperate ROCm/MIGraphX
python build.py --no-container-pull --enable-logging --enable-stats --enable-tracing --enable-rocm --endpoint=grpc --endpoint=http --backend=onnxruntime:pull/231/head
Previous attempts got me to where it looks like we were using the gen_ort_docker.py script but failing on tensorRT build related items and thats what got me looking. Brian and I have been going back and forth to confirm if we're seeing the same failures and pair debugging.
I've now forked the server repo and run a custom build based on their requirements in their documentatin found here: https://github.com/TedThemistokleous/server/tree/add_migraphx_rocm_hooks/docs/customization_guide
I've also added the server repo side changes which I'm testing into: add_migraphx_rocm_hooks off my fork.
Don't we need --enable-gpu
?
Here's a tidbit from the issues page:
**By default, if GPU support is enabled, the base image is set to the Triton NGC min container, otherwise ubuntu:22.04 image is used for CPU-only build.
If you'd like to create a Triton container that is based on other image, you can set the flag --base, image=
Hitting errors with cuda libraries now. I think I need to start adding hipify links to the API if I'm not including --enable-gpu as a flag
erver/_deps/repo-backend-src/src -I/tmp/tritonbuild/tritonserver/build/triton-server/_deps/repo-core-src/include -I/tmp/tritonbuild/tritonserver/build/triton-server/_deps/repo-common-src/src/../include -I/tmp/tritonbuild/tritonserver/build/triton-server/_deps/repo-common-src/include -O3 -DNDEBUG -fPIC -Wall -Wextra -Wno-unused-parameter -Werror -MD -MT _deps/repo-backend-build/CMakeFiles/triton-backend-utils.dir/src/backend_output_responder.cc.o -MF CMakeFiles/triton-backend-utils.dir/src/backend_output_responder.cc.o.d -o CMakeFiles/triton-backend-utils.dir/src/backend_output_responder.cc.o -c /tmp/tritonbuild/tritonserver/build/triton-server/_deps/repo-backend-src/src/backend_output_responder.cc
/workspace/src/memory_alloc.cc:27:10: fatal error: cuda_runtime_api.h: No such file or directory
27 | #include <cuda_runtime_api.h>
| ^~~~~~~~~~~~~~~~~~~~
compilation terminated.
Running things with
python build.py --no-container-pull --enable-logging --enable-stats --enable-tracing --enable-rocm --enable-gpu --endpoint=grpc --endpoint=http --backend=onnxruntime:pull/231/head
Just found this article on hipify-clang and hipify-perl. There's a real difference: hipify-perl does string substitutions, while hipify-clang uses a semantic parser (i.e. smarter for complex programs)
https://www.admin-magazine.com/HPC/Articles/Porting-CUDA-to-HIP
I re-ran the tutorial examples as described in my Nov. 17 comment on AWS, both to check out requirements for an AWS instance and to make sure that nothing has happened to break the examples. The tutorial still works with the same instructions, with some extra steps required for provisioning the AWS instance.
BUT I'm still getting a message that the Cuda GPU wasn't found, even though I specified instances that have a GPU. Need to investigate why.
Here are notes to myself to help streamline provisioning:
service docker start
Still trying to install the NVIDIA drivers in the AWS instance so that the Triton server will actually use the GPU. There are instructions at grid-driver but I'm currently trying to set credentials so that the aws
command will work. Need to find out what GRID is and if it's what I want.
Update: this page says that Tesla drivers, not Grid drivers, are for ML and other computational tasks.
Update #2: this page gives a list that shows the g3.8xlarge does come with Tesla drivers installed. Back to working out why Triton says it can't find it.
Update #3: driver solved. They lied--the instance did not have a driver, but it can be installed with sudo apt install nvidia-driver-535; nvidia-smi
But now it runs out of disk space when I run inference.
Struggled a bit with trying to port over what onnxruntime was doing with hipify as they perform a custom replace after the initial hipify step for pieces of onnxruntime to compile.
Applying things to the triton-inference server leads into a non obvious rabbit hole, and takes away from the task at hand. I've asked Jeff Daily for help here since he's more familiar in how best to get things hipified/integrated into CMake.
In the meantime, I've gone over multiple files and run hipify-perl over them in the trinton server repo as well as manually rename every item with TRITON_ENABLE_GPU in the code for now. The intent here is to get a working compile before cleaning things up.
I've not hit a point with a few more CMake changes where I am compiling and just failing on the link step
[ 53%] Linking CXX executable memory_alloc
/opt/conda/envs/py_3.9/lib/python3.9/site-packages/cmake/data/bin/cmake -E cmake_link_script CMakeFiles/memory_alloc.dir/link.txt --verbose=0
/usr/bin/ld: CMakeFiles/memory_alloc.dir/memory_alloc.cc.o: in function `(anonymous namespace)::ResponseRelease(TRITONSERVER_ResponseAllocator*, void*, void*, unsigned long, TRITONSERVER_memorytype_enum, long)':
memory_alloc.cc:(.text+0xa09): undefined reference to `hipSetDevice'
/usr/bin/ld: memory_alloc.cc:(.text+0xa18): undefined reference to `hipGetErrorString'
/usr/bin/ld: memory_alloc.cc:(.text+0xb14): undefined reference to `hipFree'
/usr/bin/ld: CMakeFiles/memory_alloc.dir/memory_alloc.cc.o: in function `(anonymous namespace)::ResponseAlloc(TRITONSERVER_ResponseAllocator*, char const*, unsigned long, TRITONSERVER_memorytype_enum, long, void*, void**, void**, TRITONSERVER_memorytype_enum*, long*)':
memory_alloc.cc:(.text+0x11df): undefined reference to `hipSetDevice'
/usr/bin/ld: memory_alloc.cc:(.text+0x11fe): undefined reference to `hipGetErrorString'
/usr/bin/ld: memory_alloc.cc:(.text+0x1479): undefined reference to `hipMalloc'
/usr/bin/ld: CMakeFiles/memory_alloc.dir/memory_alloc.cc.o: in function `(anonymous namespace)::gpu_data_deleter::{lambda(void*)#1}::operator()((anonymous namespace)::gpu_data_deleter) const [clone .constprop.0]':
memory_alloc.cc:(.text+0x1848): undefined reference to `hipSetDevice'
/usr/bin/ld: memory_alloc.cc:(.text+0x185b): undefined reference to `hipFree'
/usr/bin/ld: memory_alloc.cc:(.text+0x18ca): undefined reference to `hipGetErrorString'
/usr/bin/ld: memory_alloc.cc:(.text+0x1959): undefined reference to `hipGetErrorString'
/usr/bin/ld: CMakeFiles/memory_alloc.dir/memory_alloc.cc.o: in function `main':
memory_alloc.cc:(.text.startup+0x363c): undefined reference to `hipMemcpy'
/usr/bin/ld: memory_alloc.cc:(.text.startup+0x3690): undefined reference to `hipGetErrorString'
/usr/bin/ld: memory_alloc.cc:(.text.startup+0x36f0): undefined reference to `hipMemcpy'
/usr/bin/ld: memory_alloc.cc:(.text.startup+0x4313): undefined reference to `hipSetDevice'
/usr/bin/ld: memory_alloc.cc:(.text.startup+0x433e): undefined reference to `hipMalloc'
/usr/bin/ld: memory_alloc.cc:(.text.startup+0x4371): undefined reference to `hipMemcpy'
/usr/bin/ld: memory_alloc.cc:(.text.startup+0x4392): undefined reference to `hipMalloc'
/usr/bin/ld: memory_alloc.cc:(.text.startup+0x43c8): undefined reference to `hipMemcpy'
/usr/bin/ld: memory_alloc.cc:(.text.startup+0x4439): undefined reference to `hipGetErrorString'
/usr/bin/ld: memory_alloc.cc:(.text.startup+0x4a45): undefined reference to `hipGetErrorString'
/usr/bin/ld: memory_alloc.cc:(.text.startup+0x4b41): undefined reference to `hipGetErrorString'
collect2: error: ld returned 1 exit status
Need to sort out if I'm missing some sort of dependency or DIR here as this server image builds a multi and simple version.
Another thing I've noticed which wasn't obvious when building the Onnxruntime_backend portion is that they explicity add onnxruntime EP hooks /setup code for the target EP found in onnxruntime_backend/src/onnxruntime.cc
// Add execution providers if they are requested.
// Don't need to ensure uniqueness of the providers, ONNX Runtime
// will check it.
// GPU execution providers
#ifdef TRITON_ENABLE_GPU
if ((instance_group_kind == TRITONSERVER_INSTANCEGROUPKIND_GPU) ||
(instance_group_kind == TRITONSERVER_INSTANCEGROUPKIND_AUTO)) {
triton::common::TritonJson::Value optimization;
if (model_config_.Find("optimization", &optimization)) {
triton::common::TritonJson::Value eas;
if (optimization.Find("execution_accelerators", &eas)) {
triton::common::TritonJson::Value gpu_eas;
if (eas.Find("gpu_execution_accelerator", &gpu_eas)) {
for (size_t ea_idx = 0; ea_idx < gpu_eas.ArraySize(); ea_idx++) {
triton::common::TritonJson::Value ea;
RETURN_IF_ERROR(gpu_eas.IndexAsObject(ea_idx, &ea));
std::string name;
RETURN_IF_ERROR(ea.MemberAsString("name", &name));
#ifdef TRITON_ENABLE_ONNXRUNTIME_TENSORRT
if (name == kTensorRTExecutionAccelerator) {
This shouldn't be a difficult task to add in MIGraphX and ROCm EP calls as this should be simply mapped to the same options used in the standard Onnxruntime API.
This came up when I was doing a search for the TRITON_ENABLE_GPU compile time define set as this was originally gating functionality on the server. Kind of a lucky find here I suppose as I think we would have gotten the server eventually compiled, as well as the backend and probably would have not gotten any output or errors in inference.
Got further along in the process with some suggestions from Paul. Removed hiprtc hooks and using just hip::host now for linkage. Getting father in the compile.
Running up against some issues with test being compiled as well as warnings from unused returns (nodiscard) on a few hip Function calls.
eg of one in particular
/workspace/src/shared_memory_manager.cc: In function ‘TRITONSERVER_Error* triton::server::{anonymous}::OpenCudaIPCRegion(const hipIpcMemHandle_t*, void**, int)’:
/workspace/src/shared_memory_manager.cc:205:8: error: unused variable ‘e’ [-Werror=unused-variable]
205 | auto e = hipSetDevice(device_id);
| ^
This seems to be the only warning but will probably try with the --warnings-not-errors flag
I've commented out Unit tests right now since there seems to be a similar fail with one of the test units (sequence.cc) but hoping the compile flag can resolve that too.
Seems to be an issue now with libevent.
[100%] Linking CXX executable tritonserver
/opt/conda/envs/py_3.9/lib/python3.9/site-packages/cmake/data/bin/cmake -E cmake_link_script CMakeFiles/main.dir/link.txt --verbose=0
/usr/bin/ld: cannot find -levent_extra?: No such file or directory
collect2: error: ld returned 1 exit status
Libevent seems to be installed but not sure why this isn't being linked correctly still. More digging required.
The following still needs to be streamlined, but it works. This is the same example as before, but running on an AWS instance instead of one of our host machines. The biggest difference is that the AWS instance has an Nvidia GPU and runs a Cuda driver--before, Triton defaulted to using a CPU.
Note that this process requires both server and client Docker containers to be run on the same AWS instance and network with each other using --network=host
. I have yet to work out how to open up the AWS instance and the server to Internet requests.
Also, I haven't explained how to create and connect to an AWS instance. Create an instance with the following attributes
Start a console
Add an ssh key pair, not shown. I did it by cutting and pasting existing keys.
sudo apt-get install -y docker docker.io gcc make linux-headers-$(uname -r) awscli
echo that installed everything except nvidia-container-toolkit and cuda-drivers which have to be fetched from Nvidia distributions. A dialog appears to restart drivers \(accept defaults\)
echo Fetch the Triton repository, go into it and fetch the models and backend config
git clone git@github.com:triton-inference-server/server.git
cd server/docs/examples
./fetch_models.sh
echo "backend: \"onnxruntime\"" | tee -a model_repository/densenet_onnx/config.pbtxt
echo Go back to home directory \(optional\) to install nvidia-container-toolkit and CUDA drivers. We will have to reboot afterwards.
cd ~
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update
echo Install container toolkit. A dialog appears to restart drivers \(accept defaults\)
sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo apt-get upgrade -y linux-aws
cat << EOF | sudo tee --append /etc/modprobe.d/blacklist.conf
blacklist vga16fb
blacklist nouveau
blacklist rivafb
blacklist nvidiafb
blacklist rivatv
EOF
echo for grub add the following line GRUB_CMDLINE_LINUX=\"rdblacklist=nouveau\"
# sudo nano /etc/default/grub
echo "GRUB_CMDLINE_LINUX=\"rdblacklist=nouveau\"" | sudo tee -a /etc/default/grub
echo
echo Installation of the CUDA drivers for the server.
echo see https://docs.nvidia.com/datacenter/tesla/tesla-installation-notes/index.html~ for the following
echo we have already installed sudo apt-get install linux-headers-$\(uname -r\)
distribution=$(. /etc/os-release;echo $ID$VERSION_ID | sed -e 's/\.//g')
wget https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0-1_all.deb
echo a warning tells us to do the following
sudo apt-key del 7fa2af80
sudo apt-get update
echo Install CUDA drivers. A dialog appears to restart drivers \(accept defaults\) but another message tells us we should reboot.
sudo apt-get -y install cuda-drivers
sudo shutdown now
In a new console, after rebooting the instance. Start the server
cd ~/server/docs/examples
sudo docker run --rm --net=host --gpus=1 -v ${PWD}/model_repository:/models nvcr.io/nvidia/tritonserver:23.04-py3 tritonserver --model-repository=/models
In a second console, run the client from any directory location
echo this is second console
sudo docker run -it --rm --net=host nvcr.io/nvidia/tritonserver:22.07-py3-sdk /workspace/install/bin/image_client -m densenet_onnx -c 3 -s INCEPTION /workspace/images/mug.jpg
Update: removed the flags --runtime=nvidia --gpus all
from the last (sudo docker) command line here. The client should have no need of a GPU and be able to run from the default runtime.
Good news finally. Got the server built now using ROCm and hip::host() libs
Finally getting to the backend build. Need to sort out additional script pieces tomorrow for the onnxruntime backend and how the server build scripts interface with it
Cloning into 'onnxruntime'...
remote: Enumerating objects: 40, done.
remote: Counting objects: 100% (40/40), done.
remote: Compressing objects: 100% (33/33), done.
remote: Total 40 (delta 11), reused 20 (delta 2), pack-reused 0
Receiving objects: 100% (40/40), 56.45 KiB | 1.53 MiB/s, done.
Resolving deltas: 100% (11/11), done.
remote: Enumerating objects: 536, done.
remote: Counting objects: 100% (536/536), done.
remote: Compressing objects: 100% (248/248), done.
remote: Total 516 (delta 332), reused 394 (delta 222), pack-reused 0
Receiving objects: 100% (516/516), 126.90 KiB | 1.57 MiB/s, done.
Resolving deltas: 100% (332/332), completed with 15 local objects.
From https://github.com/triton-inference-server/onnxruntime_backend
* [new ref] refs/pull/231/head -> tritonbuildref
Switched to branch 'tritonbuildref'
CMake Error at CMakeLists.txt:369:
Parse error. Expected a newline, got identifier with text
"TRITON_ENABLE_ROCM".
-- Configuring incomplete, errors occurred!
error: build failed
Hello, I am really curious to know if Triton Inference Server can work on AMD GPUS ?
I always thought it would only work on Nvidia GPUs.
@MatthieuToulemont We thought so too.
Hello, I am really curious to know if Triton Inference Server can work on AMD GPUS ?
I always thought it would only work on Nvidia GPUs.
The existence of this issue should give you your answer. Triton is designed to allow it, but it has not yet been proved in practice. Triton allows the user to specify a backend at server startup time, and a backend can be built with a specified execution provider. We're trying to build and demonstrate this configuration with MigraphX as the execution provider. MigraphX is the inference engine for AMD GPUs.
Just did a sanity check on this as was still having issues with the backend piece.
Looks like base container of the server builds okay with ROCm. Now need to add in Onnxruntime piece and figure out why we're still pulling the nvidia container instead of uinsg the container specified here as BASE_IMAGE
Sending build context to Docker daemon 211.6MB
Step 1/10 : ARG TRITON_VERSION=2.39.0
Step 2/10 : ARG TRITON_CONTAINER_VERSION=23.10
Step 3/10 : ARG BASE_IMAGE=rocm/pytorch:rocm6.0_ubuntu22.04_py3.9_pytorch_2.0.1
Step 4/10 : FROM ${BASE_IMAGE}
---> 08497136e834
Step 5/10 : ARG TRITON_VERSION
---> Using cache
---> 2adf67eb6205
Step 6/10 : ARG TRITON_CONTAINER_VERSION
---> Using cache
---> c01b8b62ced1
Step 7/10 : COPY build/ci /workspace
---> 8db6b80fa205
Step 8/10 : WORKDIR /workspace
---> Running in f69a8e5bc47d
Removing intermediate container f69a8e5bc47d
---> de375e972281
Step 9/10 : ENV TRITON_SERVER_VERSION ${TRITON_VERSION}
---> Running in 057ba8e50258
Removing intermediate container 057ba8e50258
---> 2f43f5ad165e
Step 10/10 : ENV NVIDIA_TRITON_SERVER_VERSION ${TRITON_CONTAINER_VERSION}
---> Running in 2a71ecb47c02
Removing intermediate container 2a71ecb47c02
---> 54f780d1e798
Successfully built 54f780d1e798
Successfully tagged tritonserver_cibase:latest
Seeing an odd error now with the Onnxruntime build. Resolved a few issues when starting the ORT build
Initial one was the dockerfile wasn't being generated due to a slew of changes backed up. Once resolved now I'm seeing this when build off 6.0.2 using an released container with torch.
include could not find requested file:
ROCMHeaderWrapper
Getting an ORT build now (step 27). Tail end placement of libs seems to have changed. Sorting this out before backend completes
=> [26/41] WORKDIR /workspace/onnxruntime 0.0s
=> [27/41] RUN ./build.sh --config Release --skip_submodule_sync --parallel --build_shared_lib --build_dir /workspace/build --cmake_extra_defines CMAKE_HIP_COMPILER=/opt/rocm/llvm/bin/clang++ --update --build --use_rocm --allow_running_as_root --rocm_version "6.0.2" --rocm_home "/opt/rocm/" --use_migraphx --m 3265.4s
=> [28/41] WORKDIR /opt/onnxruntime 0.0s
=> [29/41] RUN mkdir -p /opt/onnxruntime && cp /workspace/onnxruntime/LICENSE /opt/onnxruntime && cat /workspace/onnxruntime/cmake/external/onnx/VERSION_NUMBER > /opt/onnxruntime/ort_onnx_version.txt 0.4s
=> [30/41] RUN mkdir -p /opt/onnxruntime/include && cp /workspace/onnxruntime/include/onnxruntime/core/session/onnxruntime_c_api.h /opt/onnxruntime/include && cp /workspace/onnxruntime/include/onnxruntime/core/session/onnxruntime_session_options_config_keys.h /opt/onnxruntime/include && 0.4s
=> [31/41] RUN mkdir -p /opt/onnxruntime/lib && cp /workspace/build/Release/libonnxruntime_providers_shared.so /opt/onnxruntime/lib && cp /workspace/build/Release/libonnxruntime.so /opt/onnxruntime/lib 0.4s
=> [32/41] RUN mkdir -p /opt/onnxruntime/bin && cp /workspace/build/Release/onnxruntime_perf_test /opt/onnxruntime/bin && cp /workspace/build/Release/onnx_test_runner /opt/onnxruntime/bin && (cd /opt/onnxruntime/bin && chmod a+x *) 0.4s
=> [33/41] RUN cp /workspace/build/Release/libonnxruntime_providers_rocm.so /opt/onnxruntime/lib 1.1s
=> ERROR [34/41] RUN cp /workspace/onnxruntime/include/onnxruntime/core/providers/migraphx/migraphx_provider_factory.h /opt/onnxruntime/include && cp /workspace/build/Release/libonnxruntime_providers_migraphx.so /opt/onnxruntime/lib 0.4s
------
> importing cache manifest from tritonserver_onnxruntime:
------
------
> importing cache manifest from tritonserver_onnxruntime_cache0:
------
------
> importing cache manifest from tritonserver_onnxruntime_cache1:
------
------
> [34/41] RUN cp /workspace/onnxruntime/include/onnxruntime/core/providers/migraphx/migraphx_provider_factory.h /opt/onnxruntime/include && cp /workspace/build/Release/libonnxruntime_providers_migraphx.so /opt/onnxruntime/lib:
0.362 cp: cannot stat '/workspace/onnxruntime/include/onnxruntime/core/providers/migraphx/migraphx_provider_factory.h': No such file or directory
------
Looks like DIR is supposed to be /workspace/onnxruntime/onnxruntime/core/providers/migraphx/migraphx_provider_factory.h
Change must have been different between a few of the other previous change sets, or when this came up in the original build I modified the docker and not the generator script. I'll know in a few hours if this worked.
I got it to build! @TedThemistokleous , see commit dc6db2d4932
in branch add_migraphx_rocm_hooks_v2.39.0
of the server
repo along with commit 42973f3ef5c
in branch add_migraphx_rocm_onnxrt_eps
in the onnx_runtime
repo (your fork).
Next step: run it. As discussed, I anticipate runtime library errors.
@bpickrel confirmed build. Try to get a CPU inference run.
I've added changes for the GPU APIs now for the onnxruntime_backend since we've got things building and retrying a server build.
Looks like we need to also modify the backend and not just onnxruntime_backend() as I couldn't find the suggested kGPUIOExecutionAccelerator in the onnxruntime_backend piece as well as RocmStream() failing even though that's a part of Onnxruntime.
/tmp/tritonbuild/onnxruntime_backend/src/onnxruntime.cc: In member function ‘TRITONSERVER_Error* triton::backend::onnxruntime::ModelState::LoadModel(const string&, TRITONSERVER_InstanceGroupKind, int32_t, std::string*, OrtSession**, OrtAllocator**, triton::backend::cudaStream_t)’:
/tmp/tritonbuild/onnxruntime_backend/src/onnxruntime.cc:541:25: error: ‘kMIGraphXExecutionAccelerator’ was not declared in this scope; did you mean ‘kGPUIOExecutionAccelerator’?
541 | if (name == kMIGraphXExecutionAccelerator) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| kGPUIOExecutionAccelerator
In file included from /tmp/tritonbuild/onnxruntime_backend/src/onnxruntime.cc:38:
/tmp/tritonbuild/onnxruntime_backend/src/onnxruntime.cc: In constructor ‘triton::backend::onnxruntime::ModelInstanceState::ModelInstanceState(triton::backend::onnxruntime::ModelState*, TRITONBACKEND_ModelInstance*)’:
/tmp/tritonbuild/onnxruntime_backend/src/onnxruntime.cc:1193:28: error: ‘RocmStream’ was not declared in this scope
1193 | &default_allocator_, RocmStream()));
| ^~~~~~~~~~
@bpickrel created a fork here and added you as a collaborator: https://github.com/TedThemistokleous/backend
It appears that they've renamed there stream for the onnxruntime_backend the same as the one used in cuda_stream_handle.h in Onnxruntime (CudaStream()) in their backend repo so I'll have to modify that to get it to build the other pieces.
Got it building again. I need to go over the backend at another time to determine what other pieces we require to add here.
@bpickrel the server should build using the previous commands and I've adjusted the backend to be used to target my fork automatically.
Removing intermediate container 8aed9e383064
---> 3455c596e3e4
Step 32/32 : COPY --chown=1000:1000 NVIDIA_Deep_Learning_Container_License.pdf .
---> 31daa4434336
Successfully built 31daa4434336
Successfully tagged tritonserver:latest
DEPRECATED: The legacy builder is deprecated and will be removed in a future release.
Install the buildx component to build images with BuildKit:
https://docs.docker.com/go/buildx/
Sending build context to Docker daemon 935.5MB
Step 1/10 : ARG TRITON_VERSION=2.39.0
Step 2/10 : ARG TRITON_CONTAINER_VERSION=23.10
Step 3/10 : ARG BASE_IMAGE=rocm/pytorch:rocm6.0.2_ubuntu22.04_py3.10_pytorch_2.1.2
Step 4/10 : FROM ${BASE_IMAGE}
---> ec9926cd9bbd
Step 5/10 : ARG TRITON_VERSION
---> Using cache
---> f1faf2329e3c
Step 6/10 : ARG TRITON_CONTAINER_VERSION
---> Using cache
---> 5eb5607cee20
Step 7/10 : COPY build/ci /workspace
---> Using cache
---> 374fbed5b529
Step 8/10 : WORKDIR /workspace
---> Using cache
---> 89bb1576b9c9
Step 9/10 : ENV TRITON_SERVER_VERSION ${TRITON_VERSION}
---> Using cache
---> 31b04fbe0bd5
Step 10/10 : ENV NVIDIA_TRITON_SERVER_VERSION ${TRITON_CONTAINER_VERSION}
---> Using cache
---> bb3f2acb8a4e
Successfully built bb3f2acb8a4e
Successfully tagged tritonserver_cibase:latest
There needs to be a hipify step added for the backend repo at compile to ensure we're handling every CUDA call to ROCm HIP. The Onnxruntime backend uses CudaStream() specified by this core backend repo.
I think if we also hipify things, we get the benefit of the memory analysis tools they seem to use after a quick glance
I'm trying the command line docker run --rm --net=host -v ${PWD}/model_repository:/models tritonserver:latest tritonserver --model-repository=/models
but getting an error finding the libs. Note that we should be running the Docker image tritonserver
, not tritonserver_cibase
(which has CI test stuff). This build.py
script builds 3 Docker images.
Some changes to workaround runtime library files not being found are in server
/branch add_migraphx_rocm_hooks_v2.39.0
and onnxruntime_backend
/branch add_migraphx_rocm_onnxrt_eps_brian
. But the server is still not correctly loading the model and exits immediately!
There seems to be some conflict between the branch tag "main" below and the version number given for onnxruntime in TRITON_VERSION_MAP in the Python script.
I built this with build.py
using the following launch.json
values:
{
// Use IntelliSense to learn about possible attributes.
// Hover to view descriptions of existing attributes.
// For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
"version": "0.2.0",
"configurations": [
{
"name": "Python Debugger: Current File with Arguments",
"type": "debugpy",
"request": "launch",
"program": "${file}",
"console": "integratedTerminal",
"args": [ "--no-container-pull", "--enable-logging",
"--enable-stats", "--enable-tracing", "--enable-rocm", "--endpoint=grpc", "--image=gpu-base,rocm/pytorch:rocm6.0.2_ubuntu22.04_py3.10_pytorch_2.1.2",
"--endpoint=http", "--backend=onnxruntime:main", "--library-paths=../onnxruntime_backend/"]
}
]
}
Partial success. The following output from the server shows that model densenet_onnx
was finally loaded successfully, without a *.so file error, but then the server inexplicably gave up and quit. It shouldn't matter that the other models on the list aren't there. @TedThemistokleous, does the tail end of this output look like it has anything to do with what you're working on?
`I0315 23:02:11.524294 1 server.cc:662] +----------------------+---------+------------------------------------------------------------------------------+ | Model | Version | Status | +----------------------+---------+------------------------------------------------------------------------------+ | densenet_onnx | 1 | READY | | inception_graphdef | 1 | UNAVAILABLE: Internal: failed to stat file /opt/tritonserver/backends/python | | simple | 1 | UNAVAILABLE: Internal: failed to stat file /opt/tritonserver/backends/python | | simple_dyna_sequence | 1 | UNAVAILABLE: Internal: failed to stat file /opt/tritonserver/backends/python | | simple_identity | 1 | UNAVAILABLE: Internal: failed to stat file /opt/tritonserver/backends/python | | simple_int8 | 1 | UNAVAILABLE: Internal: failed to stat file /opt/tritonserver/backends/python | | simple_sequence | 1 | UNAVAILABLE: Internal: failed to stat file /opt/tritonserver/backends/python | | simple_string | 1 | UNAVAILABLE: Internal: failed to stat file /opt/tritonserver/backends/python | +----------------------+---------+------------------------------------------------------------------------------+
I0315 23:02:11.524388 1 tritonserver.cc:2458] +----------------------------------+------------------------------------------------------------------------------------------------------------------------------------+ | Option | Value | +----------------------------------+------------------------------------------------------------------------------------------------------------------------------------+ | server_id | triton | | server_version | 2.39.0 | | server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_mem | | | ory cuda_shared_memory binary_tensor_data parameters statistics trace logging | | model_repository_path[0] | /models | | model_control_mode | MODE_NONE | | strict_model_config | 0 | | rate_limit | OFF | | pinned_memory_pool_byte_size | 268435456 | | min_supported_compute_capability | 6.0 | | strict_readiness | 1 | | exit_timeout | 30 | | cache_enabled | 0 | +----------------------------------+------------------------------------------------------------------------------------------------------------------------------------+
I0315 23:02:11.524421 1 server.cc:293] Waiting for in-flight requests to complete. I0315 23:02:11.524427 1 server.cc:309] Timeout 30: Found 0 model versions that have in-flight inferences I0315 23:02:11.524464 1 server.cc:324] All models are stopped, unloading models I0315 23:02:11.524470 1 server.cc:331] Timeout 30: Found 1 live models and 0 in-flight non-inference requests I0315 23:02:11.524532 1 onnxruntime.cc:2843] TRITONBACKEND_ModelInstanceFinalize: delete instance state I0315 23:02:11.534314 1 onnxruntime.cc:2843] TRITONBACKEND_ModelInstanceFinalize: delete instance state I0315 23:02:11.543810 1 onnxruntime.cc:2767] TRITONBACKEND_ModelFinalize: delete model state I0315 23:02:11.543857 1 model_lifecycle.cc:603] successfully unloaded 'densenet_onnx' version 1 I0315 23:02:12.524847 1 server.cc:331] Timeout 29: Found 0 live models and 0 in-flight non-inference requests error: creating server: Internal - failed to load all models`
Successful inference on CPU (using our build but without GPU). Use the following command line, substituting your own root path for mine: docker run --name brians_container --device=/dev/kfd --device=/dev/dri -it -e LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/conda/envs/py_3.10/lib:/opt/tritonserver/backends/onnxruntime --rm --net=host -v /home/bpickrel/Triton-server/ted_repo/server/docs/examples/model_repository:/models tritonserver tritonserver --model-repository=/models --exit-on-error=false
The exit-on-error
is necessary apparently because our server can't load those models listed in the model-repository which contain a graphdef
file instead of onnx
. (It also works if you delete all the extra model directories).
That's great news!
Branches for gpu related stuff
backend: add_migrahx_rocm_eps onnxruntime_backend: add_migraphx_rocm_onnxrt_eps
My latest changes in onnxruntime_backend have what you need in terms of the onnxruntime api for MIGraphX/ROCm EPs and we should just hipify. The onnxruntime_backend won't compile correctly as it requires us to hipify the backend repo. The backend branch specified is off my fork I've added you to has an additonal MIGraphX flag needed that the onnxruntime_backend side uses
It appears the onnxruntime_backend is using code from the backend repo to perform the inference and model loading which hinges on CudaStream.
I'm having trouble replicating your situation i.e. "The onnxruntime_backend won't compile correctly as it requires us to hipify the backend repo." The behavior I expected is that if I switch to branches onnxruntime_backend:add_migraphx_rocm_onnxrt_eps
and backend: add_migrahx_rocm_eps
and run build.py
with TRITON_ENABLE_ROCM
on, then I'd see hip errors when it tried to compile the backend. I'm not seeing the expected errors. I suspect it isn't really compiling the backend. I'd like to see how you did those settings.
_Update: this didn't work. We don't want to set TRITON_ENABLE_GPU
after all._
Ready to begin with hipify. I think I've got the requisite build variables either set or worked around to set up the environment for Hipify. The key addition in backend/CMakeLists.txt
is
if(TRITON_ENABLE_ROCM OR TRITON_ENABLE_MIGRAPHX)
set(TRITON_ENABLE_GPU ON)
endif()
which will force it to use the GPU code, since the code is written without TRITON_ENABLE_ROCM
defs.
hipify build now a success. (I used hipify-perl
and not hipify-clang
.) But the job's not done yet, as the resulting Docker somehow still isn't connecting to the GPU driver and posts an error message at runtime.
To try the build, check out Ted's git repo/branch server/add_migraphx_rocm_hooks_v2.39.0
(commit 653ae817998687ae9e9f74064b4399fe477690ca or later) and run build.py with these args as seen in VSC launch.json. --verbose
is optional, of course.
"args": [ "--no-container-pull", "--enable-logging",
"--enable-stats", "--enable-tracing",
"--enable-rocm",
// "--enable-gpu",
"--enable-metrics",
// "--enable-cpu-metrics=false",
"--verbose",
"--no-core-build" ,
"--endpoint=grpc", "--image=gpu-base,rocm/pytorch:rocm6.0.2_ubuntu22.04_py3.10_pytorch_2.1.2",
"--ort_organization=https://github.com/TedThemistokleous", "--ort_branch=add_migraphx_rocm_onnxrt_eps",
"--endpoint=http", "--backend=onnxruntime:main", "--library-paths=../onnxruntime_backend/"]
Looks like hip
drivers aren't getting installed in any of the 3 Triton Docker images. Documentation at Installing HIP says Hip is automatically installed when rocm is installed--but apparently this is incomplete. Looks like the hip libraries are installed but the drivers aren't. To verify, look for directory /opt/rocm/hip/
in any container.
Update: on reflection I'm not sure if this is right. ISTR that Docker containers aren't supposed to have their own drivers or access the GPU directly, but interface with the "outside world" of the operating system for access. Also not sure if any of the contents of /opt/rocm/hip/
are actual drivers.
Triton server is running with hip-ified onnxruntime_backend
code, but it insists on using the CPU instead of the GPU, according to these log messages I see at startup. Note that this occurred in the process of loading the densenet_onnx
model:
I0412 18:13:21.078968 1 model_lifecycle.cc:461] loading: densenet_onnx:1
I0412 18:13:21.079752 1 onnxruntime.cc:2780] TRITONBACKEND_Initialize: onnxruntime
I0412 18:13:21.079761 1 onnxruntime.cc:2790] Triton TRITONBACKEND API version: 1.16
I0412 18:13:21.079768 1 onnxruntime.cc:2796] 'onnxruntime' TRITONBACKEND API version: 1.16
I0412 18:13:21.079774 1 onnxruntime.cc:2826] backend configuration:
{"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}}
I0412 18:13:21.110257 1 onnxruntime.cc:2891] TRITONBACKEND_ModelInitialize: densenet_onnx (version 1)
I0412 18:13:21.110927 1 onnxruntime.cc:826] skipping model configuration auto-complete for 'densenet_onnx': inputs and outputs already specified
I0412 18:13:21.111875 1 onnxruntime.cc:2956] TRITONBACKEND_ModelInstanceInitialize: densenet_onnx_0 (**CPU device 0**)
Need to understand why the server decided CPU device 0
was selected instead of the GPU. Apparently it's considered an attribute of the model file, even though our example densenet model configuration doesn't mention device kind. This log message is created when server reads a model config from the config.pbtxt
file and instantiates the ModelInstance class. The code is located in the onnxruntime_backend
repo. The member variable being referenced is BackendModelInstance::_kind
and its type is described in backend/src/backend_model_instance.cc
(that is, code in the backend
repository and not in the onnxruntime_backend
repository):
/// TRITONSERVER_InstanceGroupKind
///
/// Kinds of instance groups recognized by TRITONSERVER.
///
typedef enum TRITONSERVER_instancegroupkind_enum {
TRITONSERVER_INSTANCEGROUPKIND_AUTO,
TRITONSERVER_INSTANCEGROUPKIND_CPU,
TRITONSERVER_INSTANCEGROUPKIND_GPU,
TRITONSERVER_INSTANCEGROUPKIND_MODEL
} TRITONSERVER_InstanceGroupKind;
Learning about the use of the InstanceGroupKind
and potentially a new implementation of it for AMD GPU's looks to be a significant investigation in itself.
That doesn't make sense. It should be agnostic of what the model file says as you can specify a device when using the API.
In the final image before you run an inference is there linkage to MIGraphX.so at all for Onnxruntime?
Can you go in the image itself and check avainabile execution providers via
python3
import onnxruntime as ort
ort.get_available_providers()
We should see all our providers (CPU, ROCm MIGraphX). Either we're loading in something into the CPU EP by default when the Session is invoked by triton but they all should be using the same C++ API through Onnxruntime.
/opt/tritonserver/backends/onnxruntime/libonnxruntime_providers_migraphx.so
Recap of how to run/debug our code. The following is the result of trial-and-error learning over the past few weeks:
server
repo first.
git clone git@github.com:TedThemistokleous/server.git
cd server
git switch add_migraphx_rocm_hooks_v2.39.0
To build, run the Python script build.py
. This creates a script cmake_build
as well as a Docker image named tritonserver_buildbase
, and runs the script in the Docker image. This in turn builds two other Docker images named tritonserver_cibase
and tritonserver
. The latter is the one to run inferences on.
The build script cmake_build
checks out the two backend repositories, inside of Docker, during the build process. We've set parameters that select Ted's Github location and the branches add_migraphx_rocm_eps
(same branch name for both backend repos). This means:
cmake_build
to get within Docker.For debugging, you can sidestep the automated process by running the tritonserver_buildbase
Docker image again either with the same script or interactively. Inside a bash
shell, you can run cmake_build
or cut/paste individual lines from it into the command line. Suggest you skip over these first few lines when debugging, or they'll cause confusing exits:
# Exit script immediately if any command fails
set -e
set -x
Sample command lines to start up the Docker container for build:
docker run -w /workspace/build --name brian_ter_x -it --rm -v /var/run/docker.sock:/var/run/docker.sock tritonserver_buildbase /bin/bash
or
docker run -w /workspace/build --name brian_ter_x -it --rm -v /var/run/docker.sock:/var/run/docker.sock tritonserver_buildbase ./cmake_build
Beware of giving your debugging Docker instances names containing "tritonserver" because it's bug-prone: the build script does a grep search for existing instances and may not build the wanted Docker image if there's any name clutter.
Sample command line to start the server (the server-side portion of the same example inference as in the comment dated Nov. 17):
docker run --name brians_container --device=/dev/kfd --device=/dev/dri -it -e LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/conda/envs/py_3.10/lib:/opt/tritonserver/backends/onnxruntime --rm --net=host -v /home/bpickrel/Triton-server/ted_repo/server/docs/examples/model_repository:/models tritonserver tritonserver --model-repository=/models --exit-on-error=false
Note that both the Docker image and the command it runs are called tritonserver
.
In this example, I've replaced the Nvidia-centric command switch for finding the GPU device driver --gpus=all
with our own --device=/dev/kfd --device=/dev/dri
. I've left off without figuring out why it fails to find the GPU regardless.
Finally, the arguments I passed to the Python script build.py. You can enter these on the command line or run it from Visual Studio Code by pasting the following into file launch.json
as I did. Here's the whole file, but the args:
block is the only critical part:
{
// Use IntelliSense to learn about possible attributes.
// Hover to view descriptions of existing attributes.
// For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
"version": "0.2.0",
"configurations": [
{
"name": "Python Debugger: Current File with Arguments",
"type": "debugpy",
"request": "launch",
"program": "${file}",
"console": "integratedTerminal",
"args": [ "--no-container-pull", "--enable-logging",
"--enable-stats", "--enable-tracing",
"--enable-rocm",
// "--enable-gpu",
"--enable-metrics",
// "--enable-cpu-metrics=false",
"--verbose",
"--no-core-build" ,
"--endpoint=grpc", "--image=gpu-base,rocm/pytorch:rocm6.0.2_ubuntu22.04_py3.10_pytorch_2.1.2",
"--ort_organization=https://github.com/TedThemistokleous", "--ort_branch=add_migraphx_rocm_onnxrt_eps",
"--endpoint=http", "--backend=onnxruntime:main", "--library-paths=../onnxruntime_backend/"]
}
]
}
Putting this project aside as it's proving to be too long of a job to get it working. The previous comment explains how to pick up again, if we decide to restart it.
Can this be done by leveraging the onnxruntime work we already have as a back end?
As a preliminary step, learn to add a Cuda back end, then change it to MIGraphX/ROCm
See https://github.com/triton-inference-server/onnxruntime_backend and https://github.com/triton-inference-server/onnxruntime_backend#onnx-runtime-with-tensorrt-optimization
Documentation for building the back end is at server docs Development Build of Backend or Repository Agent