ROCm / AMDMIGraphX

AMD's graph optimization engine.
https://rocm.docs.amd.com/projects/AMDMIGraphX/en/latest/
MIT License
184 stars 82 forks source link

MIGraphX execution provider for Triton Inference Server #2411

Open bpickrel opened 11 months ago

bpickrel commented 11 months ago

Can this be done by leveraging the onnxruntime work we already have as a back end?

As a preliminary step, learn to add a Cuda back end, then change it to MIGraphX/ROCm

See https://github.com/triton-inference-server/onnxruntime_backend and https://github.com/triton-inference-server/onnxruntime_backend#onnx-runtime-with-tensorrt-optimization

Documentation for building the back end is at server docs Development Build of Backend or Repository Agent

TedThemistokleous commented 11 months ago

Use the following target models for testing:

resnet50 Bert distilgpt2

Babystep the process for this and see what we need/can leverage from existing backends/Execution provider

bpickrel commented 11 months ago

Latest update:

bpickrel commented 10 months ago

Note to myself on how I ran an example. This doesn't introduce the execution provider, yet.

  1. Get the triton-inference-server repo git clone git@github.com:triton-inference-server/server.git
  2. Go to the examples directory and fetch the example models
    cd ~/Triton-server/server/docs/examples
    ./fetch_models.sh

    _Note: modelrepository directory != model-repository. We want the one with the underscore.

  3. Set the backend choice in the config file for our model nano model_repository/densenet_onnx/config.pbtxt Add the line backend: "onnxruntime"
  4. In a different console, same directory, run a prebuilt Docker image of the server docker run --rm --net=host -v ${PWD}/model_repository:/models nvcr.io/nvidia/tritonserver:23.04-py3 tritonserver --model-repository=/models

    • this takes a long time to pull the first time
    • with no --gpus=1 argument, it can run without having Cuda up. It does inference with CPU.
    • Question--What do I use to specify the Navi GPU?
    • Question--what do the port numbers do? I can use those numbers for the server, but still have to use --net=host for the client. How can I tell the client to connect to a port or a URL?
      1. In the original console, run a different Docker for the example client docker run -it --rm --net=host nvcr.io/nvidia/tritonserver:22.07-py3-sdk /workspace/install/bin/image_client -m densenet_onnx -c 3 -s INCEPTION /workspace/images/mug.jpg

Next steps: how to call the Migraphx execution provider? How to use Cuda? When I get my AWS setup working right, I can run this on an EC2 instance with various GPU configs.

bpickrel commented 10 months ago

The above doesn't go in the order of Ted's earlier note. I'm running a prebuilt Docker image of a server before having built my own server.

bpickrel commented 10 months ago

Here's what the onnxruntime shared libraries look like, as installed in that server Docker I used above:

root@home-tower:/opt/tritonserver/backends# ll onnxruntime/
total 507608
drwxrwxrwx  3 triton-server triton-server      4096 Apr 18  2023 ./
drwxrwxrwx 13 triton-server triton-server      4096 Apr 18  2023 ../
-rw-rw-rw-  1 triton-server triton-server      1073 Apr 18  2023 LICENSE
drwxrwxrwx  2 triton-server triton-server      4096 Apr 18  2023 LICENSE.openvino/
-rw-rw-rw-  1 triton-server triton-server  21015528 Apr 18  2023 libonnxruntime.so
-rw-rw-rw-  1 triton-server triton-server 420751256 Apr 18  2023 libonnxruntime_providers_cuda.so
-rw-rw-rw-  1 triton-server triton-server    559152 Apr 18  2023 libonnxruntime_providers_openvino.so
-rw-rw-rw-  1 triton-server triton-server     15960 Apr 18  2023 libonnxruntime_providers_shared.so
-rw-rw-rw-  1 triton-server triton-server    548472 Apr 18  2023 libonnxruntime_providers_tensorrt.so
-rw-rw-rw-  1 triton-server triton-server  12953944 Apr 18  2023 libopenvino.so
-rw-rw-rw-  1 triton-server triton-server    288352 Apr 18  2023 libopenvino_c.so
-rw-rw-rw-  1 triton-server triton-server  32005816 Apr 18  2023 libopenvino_intel_cpu_plugin.so
-rw-rw-rw-  1 triton-server triton-server    332096 Apr 18  2023 libopenvino_ir_frontend.so
-rw-rw-rw-  1 triton-server triton-server   3684352 Apr 18  2023 libopenvino_onnx_frontend.so
lrwxrwxrwx  1 triton-server triton-server        11 Apr 18  2023 libtbb.so -> libtbb.so.2
-rw-rw-rw-  1 triton-server triton-server    438832 Apr 18  2023 libtbb.so.2
-rw-rw-rw-  1 triton-server triton-server    689616 Apr 18  2023 libtriton_onnxruntime.so
-rwxrwxrwx  1 triton-server triton-server  22923312 Apr 18  2023 onnx_test_runner*
-rwxrwxrwx  1 triton-server triton-server   3529192 Apr 18  2023 onnxruntime_perf_test*
-rw-rw-rw-  1 triton-server triton-server         7 Apr 18  2023 ort_onnx_version.txt
-rw-rw-rw-  1 triton-server triton-server      1056 Apr 18  2023 plugins.xml
TedThemistokleous commented 10 months ago

I'll need to take this over. It looks like what Brian's done works, pulls in Onnxruntime as a backend and then adds the tensorRT/Cuda side. I'll poke around this week/next to get this running with our flavor of Onnxruntime with the ROCm/ MIGraphX EPs.

TedThemistokleous commented 10 months ago

Yahtzee

https://github.com/triton-inference-server/onnxruntime_backend/blob/main/tools/gen_ort_dockerfile.py

TedThemistokleous commented 10 months ago

Got changes generating and building a dockerfile @bpickrel . One I've got this building I'll pass this back to you to generate the docker and then run it before passing this through another inference.

Changes are up on my fork: https://github.com/TedThemistokleous/onnxruntime_backend/blob/add_migraphx_rocm_onnxrt_eps/tools/gen_ort_dockerfile.py

I found that we have upstream triton images for ROCm we can leverage which I'm currently building.

https://hub.docker.com/r/rocm/oai-triton/tags

So to run the script you just need to run the following off ROCm 5.7 dockerfile with all the triton inference server pieces

python3 tools/gen_ort_dockerfile.py --migraphx-home=/opt/rocm --ort-migraphx --rocm-home=/opt/rocm/ --rocm-version=5.7 --enable-rocm --migraphx-version=rocm-5.7.1 --ort-version=1.7.0  --output=migx_rocm_triton_inf.dockerfile --triton-container=rocm/oai-triton:preview_2023-11-29_182

then

docker build -t migx_rocm_tritron -f migx_rocm_triton_inf.dockerfile .

That should create the docker you would need to run an inference similar to the previous example

TedThemistokleous commented 10 months ago

Hitting a failure when attempting to build onnxruntime this way. Currently looking into this.

TedThemistokleous commented 9 months ago

Finally some good news after debugging. @bpickrel @causten

I'm able to build a container off the generated dockerfile and get the proper hooks/links pieces to work for an MIGraphX + ROCm

  1. Needed to rework some of the automation used to build this onnxruntime backend which uses a different repo than what we use to perform onnxruntime builds. Had to change the rel-XXXX to just building Onnxruntime main

  2. Needed to add additional pieces from DLM/CI dockerfiles to get Onnxruntime to build

It looks like the built container containers two binaries, one for perf and one for test.

All the library shared objects are there too after popping into the container to take a look

libonnxruntime.so  libonnxruntime.so.main  libonnxruntime_providers_migraphx.so  libonnxruntime_providers_rocm.so  libonnxruntime_providers_shared.so
root@aus-navi3x-02:/opt/onnxruntime/lib# 

There's a bin folder that seems to contain binaries we'd use to do the perf. Output of these seems interesting, maybe I should also add hooks/piecs like tensorRT here.

root@aus-navi3x-02:/opt/onnxruntime/bin# ./onnxruntime_perf_test 
perf_test [options...] model_path [result_file]
Options:
        -m [test_mode]: Specifies the test mode. Value could be 'duration' or 'times'.
                Provide 'duration' to run the test for a fix duration, and 'times' to repeated for a certain times. 
        -M: Disable memory pattern.
        -A: Disable memory arena
        -I: Generate tensor input binding (Free dimensions are treated as 1.)
        -c [parallel runs]: Specifies the (max) number of runs to invoke simultaneously. Default:1.
        -e [cpu|cuda|dnnl|tensorrt|openvino|dml|acl|nnapi|coreml|qnn|snpe|rocm|migraphx|xnnpack|vitisai]: Specifies the provider 'cpu','cuda','dnnl','tensorrt', 'openvino', 'dml', 'acl', 'nnapi', 'coreml', 'qnn', 'snpe', 'rocm', 'migraphx', 'xnnpack' or 'vitisai'. Default:'cpu'.
        -b [tf|ort]: backend to use. Default:ort
        -r [repeated_times]: Specifies the repeated times if running in 'times' test mode.Default:1000.
        -t [seconds_to_run]: Specifies the seconds to run for 'duration' mode. Default:600.
        -p [profile_file]: Specifies the profile name to enable profiling and dump the profile data to the file.
        -s: Show statistics result, like P75, P90. If no result_file provided this defaults to on.
        -S: Given random seed, to produce the same input data. This defaults to -1(no initialize).
        -v: Show verbose information.
        -x [intra_op_num_threads]: Sets the number of threads used to parallelize the execution within nodes, A value of 0 means ORT will pick a default. Must >=0.
        -y [inter_op_num_threads]: Sets the number of threads used to parallelize the execution of the graph (across nodes), A value of 0 means ORT will pick a default. Must >=0.
        -f [free_dimension_override]: Specifies a free dimension by name to override to a specific value for performance optimization. Syntax is [dimension_name:override_value]. override_value must > 0
        -F [free_dimension_override]: Specifies a free dimension by denotation to override to a specific value for performance optimization. Syntax is [dimension_denotation:override_value]. override_value must > 0
        -P: Use parallel executor instead of sequential executor.
        -o [optimization level]: Default is 99 (all). Valid values are 0 (disable), 1 (basic), 2 (extended), 99 (all).
                Please see onnxruntime_c_api.h (enum GraphOptimizationLevel) for the full list of all optimization levels.
        -u [optimized_model_path]: Specify the optimized model path for saving.
        -d [CUDA only][cudnn_conv_algorithm]: Specify CUDNN convolution algorithms: 0(benchmark), 1(heuristic), 2(default). 
        -q [CUDA only] use separate stream for copy. 
        -z: Set denormal as zero. When turning on this option reduces latency dramatically, a model may have denormals.
        -i: Specify EP specific runtime options as key value pairs. Different runtime options available are: 
            [OpenVINO only] [device_type]: Overrides the accelerator hardware type and precision with these values at runtime.
            [OpenVINO only] [device_id]: Selects a particular hardware device for inference.
            [OpenVINO only] [enable_npu_fast_compile]: Optionally enabled to speeds up the model's compilation on NPU device targets.
            [OpenVINO only] [num_of_threads]: Overrides the accelerator hardware type and precision with these values at runtime.
            [OpenVINO only] [cache_dir]: Explicitly specify the path to dump and load the blobs(Model caching) or cl_cache (Kernel Caching) files feature. If blob files are already present, it will be directly loaded.
            [OpenVINO only] [enable_opencl_throttling]: Enables OpenCL queue throttling for GPU device(Reduces the CPU Utilization while using GPU) 
            [QNN only] [backend_path]: QNN backend path. e.g '/folderpath/libQnnHtp.so', '/folderpath/libQnnCpu.so'.
            [QNN only] [profiling_level]: QNN profiling level, options: 'basic', 'detailed', default 'off'.
            [QNN only] [rpc_control_latency]: QNN rpc control latency. default to 10.
            [QNN only] [vtcm_mb]: QNN VTCM size in MB. default to 0(not set).
            [QNN only] [htp_performance_mode]: QNN performance mode, options: 'burst', 'balanced', 'default', 'high_performance', 
            'high_power_saver', 'low_balanced', 'low_power_saver', 'power_saver', 'sustained_high_performance'. Default to 'default'. 
            [QNN only] [qnn_context_priority]: QNN context priority, options: 'low', 'normal', 'normal_high', 'high'. Default to 'normal'. 
            [QNN only] [qnn_saver_path]: QNN Saver backend path. e.g '/folderpath/libQnnSaver.so'.
            [QNN only] [htp_graph_finalization_optimization_mode]: QNN graph finalization optimization mode, options: 
            '0', '1', '2', '3', default is '0'.
         [Usage]: -e <provider_name> -i '<key1>|<value1> <key2>|<value2>'

         [Example] [For OpenVINO EP] -e openvino -i "device_type|CPU_FP32 enable_npu_fast_compile|true num_of_threads|5 enable_opencl_throttling|true cache_dir|"<path>""
         [Example] [For QNN EP] -e qnn -i "backend_path|/folderpath/libQnnCpu.so" 

            [TensorRT only] [trt_max_partition_iterations]: Maximum iterations for TensorRT parser to get capability.
            [TensorRT only] [trt_min_subgraph_size]: Minimum size of TensorRT subgraphs.
            [TensorRT only] [trt_max_workspace_size]: Set TensorRT maximum workspace size in byte.
            [TensorRT only] [trt_fp16_enable]: Enable TensorRT FP16 precision.
            [TensorRT only] [trt_int8_enable]: Enable TensorRT INT8 precision.
            [TensorRT only] [trt_int8_calibration_table_name]: Specify INT8 calibration table name.
            [TensorRT only] [trt_int8_use_native_calibration_table]: Use Native TensorRT calibration table.
            [TensorRT only] [trt_dla_enable]: Enable DLA in Jetson device.
            [TensorRT only] [trt_dla_core]: DLA core number.
            [TensorRT only] [trt_dump_subgraphs]: Dump TRT subgraph to onnx model.
            [TensorRT only] [trt_engine_cache_enable]: Enable engine caching.
            [TensorRT only] [trt_engine_cache_path]: Specify engine cache path.
            [TensorRT only] [trt_force_sequential_engine_build]: Force TensorRT engines to be built sequentially.
            [TensorRT only] [trt_context_memory_sharing_enable]: Enable TensorRT context memory sharing between subgraphs.
            [TensorRT only] [trt_layer_norm_fp32_fallback]: Force Pow + Reduce ops in layer norm to run in FP32 to avoid overflow.
         [Usage]: -e <provider_name> -i '<key1>|<value1> <key2>|<value2>'

         [Example] [For TensorRT EP] -e tensorrt -i 'trt_fp16_enable|true trt_int8_enable|true trt_int8_calibration_table_name|calibration.flatbuffers trt_int8_use_native_calibration_table|false trt_force_sequential_engine_build|false'
            [NNAPI only] [NNAPI_FLAG_USE_FP16]: Use fp16 relaxation in NNAPI EP..
            [NNAPI only] [NNAPI_FLAG_USE_NCHW]: Use the NCHW layout in NNAPI EP.
            [NNAPI only] [NNAPI_FLAG_CPU_DISABLED]: Prevent NNAPI from using CPU devices.
            [NNAPI only] [NNAPI_FLAG_CPU_ONLY]: Using CPU only in NNAPI EP.
         [Usage]: -e <provider_name> -i '<key1> <key2>'

         [Example] [For NNAPI EP] -e nnapi -i " NNAPI_FLAG_USE_FP16 NNAPI_FLAG_USE_NCHW NNAPI_FLAG_CPU_DISABLED "
            [SNPE only] [runtime]: SNPE runtime, options: 'CPU', 'GPU', 'GPU_FLOAT16', 'DSP', 'AIP_FIXED_TF'. 
            [SNPE only] [priority]: execution priority, options: 'low', 'normal'. 
            [SNPE only] [buffer_type]: options: 'TF8', 'TF16', 'UINT8', 'FLOAT', 'ITENSOR'. default: ITENSOR'. 
            [SNPE only] [enable_init_cache]: enable SNPE init caching feature, set to 1 to enabled it. Disabled by default. 
         [Usage]: -e <provider_name> -i '<key1>|<value1> <key2>|<value2>' 

         [Example] [For SNPE EP] -e snpe -i "runtime|CPU priority|low" 

        -T [Set intra op thread affinities]: Specify intra op thread affinity string
         [Example]: -T 1,2;3,4;5,6 or -T 1-2;3-4;5-6 
                 Use semicolon to separate configuration between threads.
                 E.g. 1,2;3,4;5,6 specifies affinities for three threads, the first thread will be attached to the first and second logical processor.
                 The number of affinities must be equal to intra_op_num_threads - 1

        -D [Disable thread spinning]: disable spinning entirely for thread owned by onnxruntime intra-op thread pool.
        -Z [Force thread to stop spinning between runs]: disallow thread from spinning during runs to reduce cpu usage.
        -h: help

I've build a container on aus-navi3x-02.amd.com, named migx_rocm_tritron

You should be able to just docker into it with

docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined migx_rocm_tritron

If you want to pick another system checkout my branch and then run the following to generate a new dockerfile

python3 tools/gen_ort_dockerfile.py --ort-migraphx --rocm-home=/opt/rocm/ --rocm-version=6.0 --enable-rocm --migraphx-version=develop --ort-version=main  --output=migx_rocm_triton_inf.dockerfile --triton-container=compute-artifactory.amd.com:5000/rocm-plus-docker/framework/compute-rocm-rel-6.0:88_ubuntu22.04_py3.10_pytorch_release-2.1_011de5c  

This uses build 88 of ROCm 6.0, and builds MIGraphX from Develop and Onnxruntime off main using the upstream Microsoft repo.

Upstream changes are here: https://github.com/TedThemistokleous/onnxruntime_backend/blob/add_migraphx_rocm_onnxrt_eps/tools/gen_ort_dockerfile.py

Branch name is add_migraphx_rocm_onnxrt_eps

bpickrel commented 9 months ago

Does your Docker contain a Triton server? I thought this was a replacement for the example docker image that had the server included.

TedThemistokleous commented 9 months ago

I thought it does but it looks like I'm wrong here. I did see shared binaries of MIgraphX, Onnxrt and MIGraphX & ROCm EPs with some other scripts. Taking a look at the front end part of triton and an initial read, I we need to enable/a hook added via reading their repo. It sounds like we were using the Nvidia front end in the example instead of just using Onnxruntime + building the back end support.

"By default, build.py does not enable any of Triton's optional features but you can enable all features, backends, and repository agents with the --enable-all flag. The -v flag turns on verbose output."

D'oh!

Looks like wishful thinking on my part hoping we could just invoke the one container. I sense the Docker backend build script generated is then leveraged by the frontend to add the missing components get built in. I'll have to dig more on the front end unless you see different hooks for onnxruntime in the main repo.

TedThemistokleous commented 9 months ago

Looks like we need to invoke the container build done in the backend to the front end server by selecting things. The server builds the front end and then other components from the initial repo. all seems to be done through cmake which then through a series of flags, leverages the build script build.py from the server repo.

I've pushed up the changes to the onnxruntime_backend (pr 231 in that repo) and from the triton inference serve repo I'm invoking teh following after adding changes to their cmake script to incorperate ROCm/MIGraphX

python build.py --no-container-pull --enable-logging --enable-stats --enable-tracing --enable-rocm --endpoint=grpc --endpoint=http --backend=onnxruntime:pull/231/head

Previous attempts got me to where it looks like we were using the gen_ort_docker.py script but failing on tensorRT build related items and thats what got me looking. Brian and I have been going back and forth to confirm if we're seeing the same failures and pair debugging.

I've now forked the server repo and run a custom build based on their requirements in their documentatin found here: https://github.com/TedThemistokleous/server/tree/add_migraphx_rocm_hooks/docs/customization_guide

I've also added the server repo side changes which I'm testing into: add_migraphx_rocm_hooks off my fork.

bpickrel commented 9 months ago

Don't we need --enable-gpu ?

bpickrel commented 9 months ago

Here's a tidbit from the issues page:

**By default, if GPU support is enabled, the base image is set to the Triton NGC min container, otherwise ubuntu:22.04 image is used for CPU-only build.

If you'd like to create a Triton container that is based on other image, you can set the flag --base, image= when running build.py.**

TedThemistokleous commented 8 months ago

Hitting errors with cuda libraries now. I think I need to start adding hipify links to the API if I'm not including --enable-gpu as a flag

erver/_deps/repo-backend-src/src -I/tmp/tritonbuild/tritonserver/build/triton-server/_deps/repo-core-src/include -I/tmp/tritonbuild/tritonserver/build/triton-server/_deps/repo-common-src/src/../include -I/tmp/tritonbuild/tritonserver/build/triton-server/_deps/repo-common-src/include -O3 -DNDEBUG -fPIC -Wall -Wextra -Wno-unused-parameter -Werror -MD -MT _deps/repo-backend-build/CMakeFiles/triton-backend-utils.dir/src/backend_output_responder.cc.o -MF CMakeFiles/triton-backend-utils.dir/src/backend_output_responder.cc.o.d -o CMakeFiles/triton-backend-utils.dir/src/backend_output_responder.cc.o -c /tmp/tritonbuild/tritonserver/build/triton-server/_deps/repo-backend-src/src/backend_output_responder.cc
/workspace/src/memory_alloc.cc:27:10: fatal error: cuda_runtime_api.h: No such file or directory
   27 | #include <cuda_runtime_api.h>
      |          ^~~~~~~~~~~~~~~~~~~~
compilation terminated.

Running things with
python build.py --no-container-pull --enable-logging --enable-stats --enable-tracing --enable-rocm --enable-gpu --endpoint=grpc --endpoint=http --backend=onnxruntime:pull/231/head

bpickrel commented 8 months ago

Just found this article on hipify-clang and hipify-perl. There's a real difference: hipify-perl does string substitutions, while hipify-clang uses a semantic parser (i.e. smarter for complex programs)

https://www.admin-magazine.com/HPC/Articles/Porting-CUDA-to-HIP

bpickrel commented 8 months ago

I re-ran the tutorial examples as described in my Nov. 17 comment on AWS, both to check out requirements for an AWS instance and to make sure that nothing has happened to break the examples. The tutorial still works with the same instructions, with some extra steps required for provisioning the AWS instance.

BUT I'm still getting a message that the Cuda GPU wasn't found, even though I specified instances that have a GPU. Need to investigate why.

Here are notes to myself to help streamline provisioning:

  1. I used instance type g3.8xlarge to run the server container. A smaller instance of type g4ad.4xlarge ran out of disk space for Docker. I haven't tried running server and client in different containers.
  2. Couldn't get new SHA key to work with git, so I copied the working key files from my other computer. Must set permissions for the private key file to 600.
  3. May need to install docker and/or git. This seems to vary with the AWS instance type what's already installed, and whether it runs on AWS Linux or Ubuntu.
  4. May need to use yum instead of apt to install programs. This seems to vary with the AWS instance type too.
  5. May need to start the Docker daemon with service docker start
  6. May need to run git and docker with sudo. Need to investigate the permissions requirement, but sudo works.
bpickrel commented 8 months ago

Still trying to install the NVIDIA drivers in the AWS instance so that the Triton server will actually use the GPU. There are instructions at grid-driver but I'm currently trying to set credentials so that the aws command will work. Need to find out what GRID is and if it's what I want. Update: this page says that Tesla drivers, not Grid drivers, are for ML and other computational tasks. Update #2: this page gives a list that shows the g3.8xlarge does come with Tesla drivers installed. Back to working out why Triton says it can't find it.

Update #3: driver solved. They lied--the instance did not have a driver, but it can be installed with sudo apt install nvidia-driver-535; nvidia-smi But now it runs out of disk space when I run inference.

TedThemistokleous commented 8 months ago

Struggled a bit with trying to port over what onnxruntime was doing with hipify as they perform a custom replace after the initial hipify step for pieces of onnxruntime to compile.

Applying things to the triton-inference server leads into a non obvious rabbit hole, and takes away from the task at hand. I've asked Jeff Daily for help here since he's more familiar in how best to get things hipified/integrated into CMake.

In the meantime, I've gone over multiple files and run hipify-perl over them in the trinton server repo as well as manually rename every item with TRITON_ENABLE_GPU in the code for now. The intent here is to get a working compile before cleaning things up.

I've not hit a point with a few more CMake changes where I am compiling and just failing on the link step

[ 53%] Linking CXX executable memory_alloc
/opt/conda/envs/py_3.9/lib/python3.9/site-packages/cmake/data/bin/cmake -E cmake_link_script CMakeFiles/memory_alloc.dir/link.txt --verbose=0
/usr/bin/ld: CMakeFiles/memory_alloc.dir/memory_alloc.cc.o: in function `(anonymous namespace)::ResponseRelease(TRITONSERVER_ResponseAllocator*, void*, void*, unsigned long, TRITONSERVER_memorytype_enum, long)':
memory_alloc.cc:(.text+0xa09): undefined reference to `hipSetDevice'
/usr/bin/ld: memory_alloc.cc:(.text+0xa18): undefined reference to `hipGetErrorString'
/usr/bin/ld: memory_alloc.cc:(.text+0xb14): undefined reference to `hipFree'
/usr/bin/ld: CMakeFiles/memory_alloc.dir/memory_alloc.cc.o: in function `(anonymous namespace)::ResponseAlloc(TRITONSERVER_ResponseAllocator*, char const*, unsigned long, TRITONSERVER_memorytype_enum, long, void*, void**, void**, TRITONSERVER_memorytype_enum*, long*)':
memory_alloc.cc:(.text+0x11df): undefined reference to `hipSetDevice'
/usr/bin/ld: memory_alloc.cc:(.text+0x11fe): undefined reference to `hipGetErrorString'
/usr/bin/ld: memory_alloc.cc:(.text+0x1479): undefined reference to `hipMalloc'
/usr/bin/ld: CMakeFiles/memory_alloc.dir/memory_alloc.cc.o: in function `(anonymous namespace)::gpu_data_deleter::{lambda(void*)#1}::operator()((anonymous namespace)::gpu_data_deleter) const [clone .constprop.0]':
memory_alloc.cc:(.text+0x1848): undefined reference to `hipSetDevice'
/usr/bin/ld: memory_alloc.cc:(.text+0x185b): undefined reference to `hipFree'
/usr/bin/ld: memory_alloc.cc:(.text+0x18ca): undefined reference to `hipGetErrorString'
/usr/bin/ld: memory_alloc.cc:(.text+0x1959): undefined reference to `hipGetErrorString'
/usr/bin/ld: CMakeFiles/memory_alloc.dir/memory_alloc.cc.o: in function `main':
memory_alloc.cc:(.text.startup+0x363c): undefined reference to `hipMemcpy'
/usr/bin/ld: memory_alloc.cc:(.text.startup+0x3690): undefined reference to `hipGetErrorString'
/usr/bin/ld: memory_alloc.cc:(.text.startup+0x36f0): undefined reference to `hipMemcpy'
/usr/bin/ld: memory_alloc.cc:(.text.startup+0x4313): undefined reference to `hipSetDevice'
/usr/bin/ld: memory_alloc.cc:(.text.startup+0x433e): undefined reference to `hipMalloc'
/usr/bin/ld: memory_alloc.cc:(.text.startup+0x4371): undefined reference to `hipMemcpy'
/usr/bin/ld: memory_alloc.cc:(.text.startup+0x4392): undefined reference to `hipMalloc'
/usr/bin/ld: memory_alloc.cc:(.text.startup+0x43c8): undefined reference to `hipMemcpy'
/usr/bin/ld: memory_alloc.cc:(.text.startup+0x4439): undefined reference to `hipGetErrorString'
/usr/bin/ld: memory_alloc.cc:(.text.startup+0x4a45): undefined reference to `hipGetErrorString'
/usr/bin/ld: memory_alloc.cc:(.text.startup+0x4b41): undefined reference to `hipGetErrorString'
collect2: error: ld returned 1 exit status

Need to sort out if I'm missing some sort of dependency or DIR here as this server image builds a multi and simple version.

Another thing I've noticed which wasn't obvious when building the Onnxruntime_backend portion is that they explicity add onnxruntime EP hooks /setup code for the target EP found in onnxruntime_backend/src/onnxruntime.cc

 // Add execution providers if they are requested.
  // Don't need to ensure uniqueness of the providers, ONNX Runtime
  // will check it.

  // GPU execution providers
#ifdef TRITON_ENABLE_GPU
  if ((instance_group_kind == TRITONSERVER_INSTANCEGROUPKIND_GPU) ||
      (instance_group_kind == TRITONSERVER_INSTANCEGROUPKIND_AUTO)) {
    triton::common::TritonJson::Value optimization;
    if (model_config_.Find("optimization", &optimization)) {
      triton::common::TritonJson::Value eas;
      if (optimization.Find("execution_accelerators", &eas)) {
        triton::common::TritonJson::Value gpu_eas;
        if (eas.Find("gpu_execution_accelerator", &gpu_eas)) {
          for (size_t ea_idx = 0; ea_idx < gpu_eas.ArraySize(); ea_idx++) {
            triton::common::TritonJson::Value ea;
            RETURN_IF_ERROR(gpu_eas.IndexAsObject(ea_idx, &ea));
            std::string name;
            RETURN_IF_ERROR(ea.MemberAsString("name", &name));
#ifdef TRITON_ENABLE_ONNXRUNTIME_TENSORRT
            if (name == kTensorRTExecutionAccelerator) {

This shouldn't be a difficult task to add in MIGraphX and ROCm EP calls as this should be simply mapped to the same options used in the standard Onnxruntime API.

This came up when I was doing a search for the TRITON_ENABLE_GPU compile time define set as this was originally gating functionality on the server. Kind of a lucky find here I suppose as I think we would have gotten the server eventually compiled, as well as the backend and probably would have not gotten any output or errors in inference.

TedThemistokleous commented 8 months ago

Got further along in the process with some suggestions from Paul. Removed hiprtc hooks and using just hip::host now for linkage. Getting father in the compile.

Running up against some issues with test being compiled as well as warnings from unused returns (nodiscard) on a few hip Function calls.

eg of one in particular

/workspace/src/shared_memory_manager.cc: In function ‘TRITONSERVER_Error* triton::server::{anonymous}::OpenCudaIPCRegion(const hipIpcMemHandle_t*, void**, int)’:
/workspace/src/shared_memory_manager.cc:205:8: error: unused variable ‘e’ [-Werror=unused-variable]
  205 |   auto e = hipSetDevice(device_id);
      |        ^

This seems to be the only warning but will probably try with the --warnings-not-errors flag

I've commented out Unit tests right now since there seems to be a similar fail with one of the test units (sequence.cc) but hoping the compile flag can resolve that too.

TedThemistokleous commented 8 months ago

Seems to be an issue now with libevent.

[100%] Linking CXX executable tritonserver
/opt/conda/envs/py_3.9/lib/python3.9/site-packages/cmake/data/bin/cmake -E cmake_link_script CMakeFiles/main.dir/link.txt --verbose=0
/usr/bin/ld: cannot find -levent_extra?: No such file or directory
collect2: error: ld returned 1 exit status

Libevent seems to be installed but not sure why this isn't being linked correctly still. More digging required.

bpickrel commented 8 months ago

Running the example on Amazon Web Services

The following still needs to be streamlined, but it works. This is the same example as before, but running on an AWS instance instead of one of our host machines. The biggest difference is that the AWS instance has an Nvidia GPU and runs a Cuda driver--before, Triton defaulted to using a CPU.

Note that this process requires both server and client Docker containers to be run on the same AWS instance and network with each other using --network=host . I have yet to work out how to open up the AWS instance and the server to Internet requests.

Also, I haven't explained how to create and connect to an AWS instance. Create an instance with the following attributes

Start a console

Add an ssh key pair, not shown. I did it by cutting and pasting existing keys.

     sudo apt-get install -y docker  docker.io  gcc make linux-headers-$(uname -r)  awscli
     echo that installed everything except nvidia-container-toolkit and  cuda-drivers which have to be fetched from Nvidia distributions.  A dialog appears to restart drivers \(accept defaults\)
     echo Fetch the Triton repository, go into it and fetch the models and backend config
     git clone git@github.com:triton-inference-server/server.git
     cd server/docs/examples
     ./fetch_models.sh
     echo "backend: \"onnxruntime\"" | tee -a model_repository/densenet_onnx/config.pbtxt

   echo   Go back to home directory \(optional\) to install nvidia-container-toolkit and CUDA drivers.  We will have to reboot afterwards.

     cd ~
     curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg   && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list |     sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' |     sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
     sudo apt update
     echo Install container toolkit. A dialog appears to restart drivers \(accept defaults\)
     sudo apt  install -y nvidia-container-toolkit
     sudo nvidia-ctk runtime configure --runtime=docker
     sudo apt-get upgrade -y linux-aws
     cat << EOF | sudo tee --append /etc/modprobe.d/blacklist.conf
     blacklist vga16fb
     blacklist nouveau
     blacklist rivafb
     blacklist nvidiafb
     blacklist rivatv
     EOF
     echo for grub add the following line    GRUB_CMDLINE_LINUX=\"rdblacklist=nouveau\"
     # sudo nano /etc/default/grub
     echo "GRUB_CMDLINE_LINUX=\"rdblacklist=nouveau\"" | sudo tee -a /etc/default/grub
     echo
     echo   Installation of the CUDA drivers for the server.
     echo see https://docs.nvidia.com/datacenter/tesla/tesla-installation-notes/index.html~ for the following
     echo we have already installed sudo apt-get install linux-headers-$\(uname -r\)
     distribution=$(. /etc/os-release;echo $ID$VERSION_ID | sed -e 's/\.//g')
     wget https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/cuda-keyring_1.0-1_all.deb
     sudo dpkg -i cuda-keyring_1.0-1_all.deb
     echo a warning tells us to do the following
     sudo apt-key del 7fa2af80
     sudo apt-get update
     echo Install CUDA drivers.  A dialog appears to restart drivers \(accept defaults\) but another message tells us we should reboot.
     sudo apt-get -y install cuda-drivers
     sudo shutdown now

In a new console, after rebooting the instance. Start the server

     cd ~/server/docs/examples
     sudo docker run --rm --net=host --gpus=1 -v ${PWD}/model_repository:/models nvcr.io/nvidia/tritonserver:23.04-py3 tritonserver --model-repository=/models

In a second console, run the client from any directory location

     echo this is second console
     sudo docker run -it --rm  --net=host  nvcr.io/nvidia/tritonserver:22.07-py3-sdk  /workspace/install/bin/image_client -m densenet_onnx -c 3 -s INCEPTION /workspace/images/mug.jpg

Update: removed the flags --runtime=nvidia --gpus all from the last (sudo docker) command line here. The client should have no need of a GPU and be able to run from the default runtime.

TedThemistokleous commented 7 months ago

Good news finally. Got the server built now using ROCm and hip::host() libs

Finally getting to the backend build. Need to sort out additional script pieces tomorrow for the onnxruntime backend and how the server build scripts interface with it

Cloning into 'onnxruntime'...
remote: Enumerating objects: 40, done.
remote: Counting objects: 100% (40/40), done.
remote: Compressing objects: 100% (33/33), done.
remote: Total 40 (delta 11), reused 20 (delta 2), pack-reused 0
Receiving objects: 100% (40/40), 56.45 KiB | 1.53 MiB/s, done.
Resolving deltas: 100% (11/11), done.
remote: Enumerating objects: 536, done.
remote: Counting objects: 100% (536/536), done.
remote: Compressing objects: 100% (248/248), done.
remote: Total 516 (delta 332), reused 394 (delta 222), pack-reused 0
Receiving objects: 100% (516/516), 126.90 KiB | 1.57 MiB/s, done.
Resolving deltas: 100% (332/332), completed with 15 local objects.
From https://github.com/triton-inference-server/onnxruntime_backend
 * [new ref]         refs/pull/231/head -> tritonbuildref
Switched to branch 'tritonbuildref'
CMake Error at CMakeLists.txt:369:
  Parse error.  Expected a newline, got identifier with text
  "TRITON_ENABLE_ROCM".

-- Configuring incomplete, errors occurred!
error: build failed
MatthieuToulemont commented 7 months ago

Hello, I am really curious to know if Triton Inference Server can work on AMD GPUS ?

I always thought it would only work on Nvidia GPUs.

TedThemistokleous commented 7 months ago

@MatthieuToulemont We thought so too.

bpickrel commented 7 months ago

Hello, I am really curious to know if Triton Inference Server can work on AMD GPUS ?

I always thought it would only work on Nvidia GPUs.

The existence of this issue should give you your answer. Triton is designed to allow it, but it has not yet been proved in practice. Triton allows the user to specify a backend at server startup time, and a backend can be built with a specified execution provider. We're trying to build and demonstrate this configuration with MigraphX as the execution provider. MigraphX is the inference engine for AMD GPUs.

TedThemistokleous commented 7 months ago

Just did a sanity check on this as was still having issues with the backend piece.

Looks like base container of the server builds okay with ROCm. Now need to add in Onnxruntime piece and figure out why we're still pulling the nvidia container instead of uinsg the container specified here as BASE_IMAGE

Sending build context to Docker daemon  211.6MB
Step 1/10 : ARG TRITON_VERSION=2.39.0
Step 2/10 : ARG TRITON_CONTAINER_VERSION=23.10
Step 3/10 : ARG BASE_IMAGE=rocm/pytorch:rocm6.0_ubuntu22.04_py3.9_pytorch_2.0.1
Step 4/10 : FROM ${BASE_IMAGE}
 ---> 08497136e834
Step 5/10 : ARG TRITON_VERSION
 ---> Using cache
 ---> 2adf67eb6205
Step 6/10 : ARG TRITON_CONTAINER_VERSION
 ---> Using cache
 ---> c01b8b62ced1
Step 7/10 : COPY build/ci /workspace
 ---> 8db6b80fa205
Step 8/10 : WORKDIR /workspace
 ---> Running in f69a8e5bc47d
Removing intermediate container f69a8e5bc47d
 ---> de375e972281
Step 9/10 : ENV TRITON_SERVER_VERSION ${TRITON_VERSION}
 ---> Running in 057ba8e50258
Removing intermediate container 057ba8e50258
 ---> 2f43f5ad165e
Step 10/10 : ENV NVIDIA_TRITON_SERVER_VERSION ${TRITON_CONTAINER_VERSION}
 ---> Running in 2a71ecb47c02
Removing intermediate container 2a71ecb47c02
 ---> 54f780d1e798
Successfully built 54f780d1e798
Successfully tagged tritonserver_cibase:latest
TedThemistokleous commented 7 months ago

Seeing an odd error now with the Onnxruntime build. Resolved a few issues when starting the ORT build

Initial one was the dockerfile wasn't being generated due to a slew of changes backed up. Once resolved now I'm seeing this when build off 6.0.2 using an released container with torch.

  include could not find requested file:

    ROCMHeaderWrapper
TedThemistokleous commented 7 months ago

Getting an ORT build now (step 27). Tail end placement of libs seems to have changed. Sorting this out before backend completes

 => [26/41] WORKDIR /workspace/onnxruntime                                                                                                                                                                                                                                                                                           0.0s
 => [27/41] RUN ./build.sh --config Release --skip_submodule_sync --parallel --build_shared_lib         --build_dir /workspace/build --cmake_extra_defines CMAKE_HIP_COMPILER=/opt/rocm/llvm/bin/clang++  --update --build --use_rocm --allow_running_as_root --rocm_version "6.0.2" --rocm_home "/opt/rocm/" --use_migraphx --m  3265.4s
 => [28/41] WORKDIR /opt/onnxruntime                                                                                                                                                                                                                                                                                                 0.0s
 => [29/41] RUN mkdir -p /opt/onnxruntime &&         cp /workspace/onnxruntime/LICENSE /opt/onnxruntime &&         cat /workspace/onnxruntime/cmake/external/onnx/VERSION_NUMBER > /opt/onnxruntime/ort_onnx_version.txt                                                                                                             0.4s
 => [30/41] RUN mkdir -p /opt/onnxruntime/include &&         cp /workspace/onnxruntime/include/onnxruntime/core/session/onnxruntime_c_api.h         /opt/onnxruntime/include &&         cp /workspace/onnxruntime/include/onnxruntime/core/session/onnxruntime_session_options_config_keys.h         /opt/onnxruntime/include &&     0.4s
 => [31/41] RUN mkdir -p /opt/onnxruntime/lib &&         cp /workspace/build/Release/libonnxruntime_providers_shared.so         /opt/onnxruntime/lib &&         cp /workspace/build/Release/libonnxruntime.so         /opt/onnxruntime/lib                                                                                           0.4s
 => [32/41] RUN mkdir -p /opt/onnxruntime/bin &&         cp /workspace/build/Release/onnxruntime_perf_test         /opt/onnxruntime/bin &&         cp /workspace/build/Release/onnx_test_runner         /opt/onnxruntime/bin &&         (cd /opt/onnxruntime/bin && chmod a+x *)                                                     0.4s
 => [33/41] RUN cp /workspace/build/Release/libonnxruntime_providers_rocm.so         /opt/onnxruntime/lib                                                                                                                                                                                                                            1.1s
 => ERROR [34/41] RUN cp /workspace/onnxruntime/include/onnxruntime/core/providers/migraphx/migraphx_provider_factory.h         /opt/onnxruntime/include &&         cp /workspace/build/Release/libonnxruntime_providers_migraphx.so         /opt/onnxruntime/lib                                                                    0.4s
------
 > importing cache manifest from tritonserver_onnxruntime:
------
------
 > importing cache manifest from tritonserver_onnxruntime_cache0:
------
------
 > importing cache manifest from tritonserver_onnxruntime_cache1:
------
------
 > [34/41] RUN cp /workspace/onnxruntime/include/onnxruntime/core/providers/migraphx/migraphx_provider_factory.h         /opt/onnxruntime/include &&         cp /workspace/build/Release/libonnxruntime_providers_migraphx.so         /opt/onnxruntime/lib:
0.362 cp: cannot stat '/workspace/onnxruntime/include/onnxruntime/core/providers/migraphx/migraphx_provider_factory.h': No such file or directory
------
TedThemistokleous commented 7 months ago

Looks like DIR is supposed to be /workspace/onnxruntime/onnxruntime/core/providers/migraphx/migraphx_provider_factory.h

Change must have been different between a few of the other previous change sets, or when this came up in the original build I modified the docker and not the generator script. I'll know in a few hours if this worked.

bpickrel commented 6 months ago

I got it to build! @TedThemistokleous , see commit dc6db2d4932 in branch add_migraphx_rocm_hooks_v2.39.0 of the server repo along with commit 42973f3ef5c in branch add_migraphx_rocm_onnxrt_eps in the onnx_runtime repo (your fork).

Next step: run it. As discussed, I anticipate runtime library errors.

TedThemistokleous commented 6 months ago

@bpickrel confirmed build. Try to get a CPU inference run.

I've added changes for the GPU APIs now for the onnxruntime_backend since we've got things building and retrying a server build.

TedThemistokleous commented 6 months ago

Looks like we need to also modify the backend and not just onnxruntime_backend() as I couldn't find the suggested kGPUIOExecutionAccelerator in the onnxruntime_backend piece as well as RocmStream() failing even though that's a part of Onnxruntime.

/tmp/tritonbuild/onnxruntime_backend/src/onnxruntime.cc: In member function ‘TRITONSERVER_Error* triton::backend::onnxruntime::ModelState::LoadModel(const string&, TRITONSERVER_InstanceGroupKind, int32_t, std::string*, OrtSession**, OrtAllocator**, triton::backend::cudaStream_t)’:
/tmp/tritonbuild/onnxruntime_backend/src/onnxruntime.cc:541:25: error: ‘kMIGraphXExecutionAccelerator’ was not declared in this scope; did you mean ‘kGPUIOExecutionAccelerator’?
  541 |             if (name == kMIGraphXExecutionAccelerator) {
      |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                         kGPUIOExecutionAccelerator
In file included from /tmp/tritonbuild/onnxruntime_backend/src/onnxruntime.cc:38:
/tmp/tritonbuild/onnxruntime_backend/src/onnxruntime.cc: In constructor ‘triton::backend::onnxruntime::ModelInstanceState::ModelInstanceState(triton::backend::onnxruntime::ModelState*, TRITONBACKEND_ModelInstance*)’:
/tmp/tritonbuild/onnxruntime_backend/src/onnxruntime.cc:1193:28: error: ‘RocmStream’ was not declared in this scope
 1193 |       &default_allocator_, RocmStream()));
      |                            ^~~~~~~~~~

@bpickrel created a fork here and added you as a collaborator: https://github.com/TedThemistokleous/backend

It appears that they've renamed there stream for the onnxruntime_backend the same as the one used in cuda_stream_handle.h in Onnxruntime (CudaStream()) in their backend repo so I'll have to modify that to get it to build the other pieces.

TedThemistokleous commented 6 months ago

Got it building again. I need to go over the backend at another time to determine what other pieces we require to add here.

@bpickrel the server should build using the previous commands and I've adjusted the backend to be used to target my fork automatically.

Removing intermediate container 8aed9e383064
 ---> 3455c596e3e4
Step 32/32 : COPY --chown=1000:1000 NVIDIA_Deep_Learning_Container_License.pdf .
 ---> 31daa4434336
Successfully built 31daa4434336
Successfully tagged tritonserver:latest
DEPRECATED: The legacy builder is deprecated and will be removed in a future release.
            Install the buildx component to build images with BuildKit:
            https://docs.docker.com/go/buildx/

Sending build context to Docker daemon  935.5MB
Step 1/10 : ARG TRITON_VERSION=2.39.0
Step 2/10 : ARG TRITON_CONTAINER_VERSION=23.10
Step 3/10 : ARG BASE_IMAGE=rocm/pytorch:rocm6.0.2_ubuntu22.04_py3.10_pytorch_2.1.2
Step 4/10 : FROM ${BASE_IMAGE}
 ---> ec9926cd9bbd
Step 5/10 : ARG TRITON_VERSION
 ---> Using cache
 ---> f1faf2329e3c
Step 6/10 : ARG TRITON_CONTAINER_VERSION
 ---> Using cache
 ---> 5eb5607cee20
Step 7/10 : COPY build/ci /workspace
 ---> Using cache
 ---> 374fbed5b529
Step 8/10 : WORKDIR /workspace
 ---> Using cache
 ---> 89bb1576b9c9
Step 9/10 : ENV TRITON_SERVER_VERSION ${TRITON_VERSION}
 ---> Using cache
 ---> 31b04fbe0bd5
Step 10/10 : ENV NVIDIA_TRITON_SERVER_VERSION ${TRITON_CONTAINER_VERSION}
 ---> Using cache
 ---> bb3f2acb8a4e
Successfully built bb3f2acb8a4e
Successfully tagged tritonserver_cibase:latest

There needs to be a hipify step added for the backend repo at compile to ensure we're handling every CUDA call to ROCm HIP. The Onnxruntime backend uses CudaStream() specified by this core backend repo.

I think if we also hipify things, we get the benefit of the memory analysis tools they seem to use after a quick glance

bpickrel commented 6 months ago

I'm trying the command line docker run --rm --net=host -v ${PWD}/model_repository:/models tritonserver:latest tritonserver --model-repository=/models but getting an error finding the libs. Note that we should be running the Docker image tritonserver, not tritonserver_cibase (which has CI test stuff). This build.py script builds 3 Docker images.

bpickrel commented 6 months ago

Some changes to workaround runtime library files not being found are in server/branch add_migraphx_rocm_hooks_v2.39.0 and onnxruntime_backend/branch add_migraphx_rocm_onnxrt_eps_brian. But the server is still not correctly loading the model and exits immediately!

There seems to be some conflict between the branch tag "main" below and the version number given for onnxruntime in TRITON_VERSION_MAP in the Python script.

I built this with build.py using the following launch.json values:

{
    // Use IntelliSense to learn about possible attributes.
    // Hover to view descriptions of existing attributes.
    // For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
    "version": "0.2.0",
    "configurations": [
        {
            "name": "Python Debugger: Current File with Arguments",
            "type": "debugpy",
            "request": "launch",
            "program": "${file}",
            "console": "integratedTerminal",
            "args": [ "--no-container-pull", "--enable-logging",
             "--enable-stats", "--enable-tracing",  "--enable-rocm",  "--endpoint=grpc", "--image=gpu-base,rocm/pytorch:rocm6.0.2_ubuntu22.04_py3.10_pytorch_2.1.2", 
            "--endpoint=http", "--backend=onnxruntime:main", "--library-paths=../onnxruntime_backend/"]

        }
    ]
}
bpickrel commented 6 months ago

Partial success. The following output from the server shows that model densenet_onnx was finally loaded successfully, without a *.so file error, but then the server inexplicably gave up and quit. It shouldn't matter that the other models on the list aren't there. @TedThemistokleous, does the tail end of this output look like it has anything to do with what you're working on?

`I0315 23:02:11.524294 1 server.cc:662] +----------------------+---------+------------------------------------------------------------------------------+ | Model | Version | Status | +----------------------+---------+------------------------------------------------------------------------------+ | densenet_onnx | 1 | READY | | inception_graphdef | 1 | UNAVAILABLE: Internal: failed to stat file /opt/tritonserver/backends/python | | simple | 1 | UNAVAILABLE: Internal: failed to stat file /opt/tritonserver/backends/python | | simple_dyna_sequence | 1 | UNAVAILABLE: Internal: failed to stat file /opt/tritonserver/backends/python | | simple_identity | 1 | UNAVAILABLE: Internal: failed to stat file /opt/tritonserver/backends/python | | simple_int8 | 1 | UNAVAILABLE: Internal: failed to stat file /opt/tritonserver/backends/python | | simple_sequence | 1 | UNAVAILABLE: Internal: failed to stat file /opt/tritonserver/backends/python | | simple_string | 1 | UNAVAILABLE: Internal: failed to stat file /opt/tritonserver/backends/python | +----------------------+---------+------------------------------------------------------------------------------+

I0315 23:02:11.524388 1 tritonserver.cc:2458] +----------------------------------+------------------------------------------------------------------------------------------------------------------------------------+ | Option | Value | +----------------------------------+------------------------------------------------------------------------------------------------------------------------------------+ | server_id | triton | | server_version | 2.39.0 | | server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_mem | | | ory cuda_shared_memory binary_tensor_data parameters statistics trace logging | | model_repository_path[0] | /models | | model_control_mode | MODE_NONE | | strict_model_config | 0 | | rate_limit | OFF | | pinned_memory_pool_byte_size | 268435456 | | min_supported_compute_capability | 6.0 | | strict_readiness | 1 | | exit_timeout | 30 | | cache_enabled | 0 | +----------------------------------+------------------------------------------------------------------------------------------------------------------------------------+

I0315 23:02:11.524421 1 server.cc:293] Waiting for in-flight requests to complete. I0315 23:02:11.524427 1 server.cc:309] Timeout 30: Found 0 model versions that have in-flight inferences I0315 23:02:11.524464 1 server.cc:324] All models are stopped, unloading models I0315 23:02:11.524470 1 server.cc:331] Timeout 30: Found 1 live models and 0 in-flight non-inference requests I0315 23:02:11.524532 1 onnxruntime.cc:2843] TRITONBACKEND_ModelInstanceFinalize: delete instance state I0315 23:02:11.534314 1 onnxruntime.cc:2843] TRITONBACKEND_ModelInstanceFinalize: delete instance state I0315 23:02:11.543810 1 onnxruntime.cc:2767] TRITONBACKEND_ModelFinalize: delete model state I0315 23:02:11.543857 1 model_lifecycle.cc:603] successfully unloaded 'densenet_onnx' version 1 I0315 23:02:12.524847 1 server.cc:331] Timeout 29: Found 0 live models and 0 in-flight non-inference requests error: creating server: Internal - failed to load all models`

bpickrel commented 6 months ago

Successful inference on CPU (using our build but without GPU). Use the following command line, substituting your own root path for mine: docker run --name brians_container --device=/dev/kfd --device=/dev/dri -it -e LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/conda/envs/py_3.10/lib:/opt/tritonserver/backends/onnxruntime --rm --net=host -v /home/bpickrel/Triton-server/ted_repo/server/docs/examples/model_repository:/models tritonserver tritonserver --model-repository=/models --exit-on-error=false The exit-on-error is necessary apparently because our server can't load those models listed in the model-repository which contain a graphdef file instead of onnx. (It also works if you delete all the extra model directories).

TedThemistokleous commented 6 months ago

That's great news!

TedThemistokleous commented 6 months ago

Branches for gpu related stuff

backend: add_migrahx_rocm_eps onnxruntime_backend: add_migraphx_rocm_onnxrt_eps

My latest changes in onnxruntime_backend have what you need in terms of the onnxruntime api for MIGraphX/ROCm EPs and we should just hipify. The onnxruntime_backend won't compile correctly as it requires us to hipify the backend repo. The backend branch specified is off my fork I've added you to has an additonal MIGraphX flag needed that the onnxruntime_backend side uses

It appears the onnxruntime_backend is using code from the backend repo to perform the inference and model loading which hinges on CudaStream.

bpickrel commented 6 months ago

I'm having trouble replicating your situation i.e. "The onnxruntime_backend won't compile correctly as it requires us to hipify the backend repo." The behavior I expected is that if I switch to branches onnxruntime_backend:add_migraphx_rocm_onnxrt_eps and backend: add_migrahx_rocm_eps and run build.py with TRITON_ENABLE_ROCM on, then I'd see hip errors when it tried to compile the backend. I'm not seeing the expected errors. I suspect it isn't really compiling the backend. I'd like to see how you did those settings.

bpickrel commented 6 months ago

_Update: this didn't work. We don't want to set TRITON_ENABLE_GPU after all._ Ready to begin with hipify. I think I've got the requisite build variables either set or worked around to set up the environment for Hipify. The key addition in backend/CMakeLists.txt is

if(TRITON_ENABLE_ROCM OR TRITON_ENABLE_MIGRAPHX)
  set(TRITON_ENABLE_GPU ON)
endif()

which will force it to use the GPU code, since the code is written without TRITON_ENABLE_ROCM defs.

bpickrel commented 6 months ago

hipify build now a success. (I used hipify-perl and not hipify-clang.) But the job's not done yet, as the resulting Docker somehow still isn't connecting to the GPU driver and posts an error message at runtime.

To try the build, check out Ted's git repo/branch server/add_migraphx_rocm_hooks_v2.39.0 (commit 653ae817998687ae9e9f74064b4399fe477690ca or later) and run build.py with these args as seen in VSC launch.json. --verbose is optional, of course.

            "args": [ "--no-container-pull", "--enable-logging",
             "--enable-stats", "--enable-tracing",  
             "--enable-rocm",
            // "--enable-gpu",
             "--enable-metrics",
             // "--enable-cpu-metrics=false",
             "--verbose",
             "--no-core-build" ,
             "--endpoint=grpc", "--image=gpu-base,rocm/pytorch:rocm6.0.2_ubuntu22.04_py3.10_pytorch_2.1.2", 
             "--ort_organization=https://github.com/TedThemistokleous", "--ort_branch=add_migraphx_rocm_onnxrt_eps",
            "--endpoint=http", "--backend=onnxruntime:main", "--library-paths=../onnxruntime_backend/"]
bpickrel commented 5 months ago

Looks like hip drivers aren't getting installed in any of the 3 Triton Docker images. Documentation at Installing HIP says Hip is automatically installed when rocm is installed--but apparently this is incomplete. Looks like the hip libraries are installed but the drivers aren't. To verify, look for directory /opt/rocm/hip/ in any container.

Update: on reflection I'm not sure if this is right. ISTR that Docker containers aren't supposed to have their own drivers or access the GPU directly, but interface with the "outside world" of the operating system for access. Also not sure if any of the contents of /opt/rocm/hip/ are actual drivers.

bpickrel commented 5 months ago

Triton server is running with hip-ified onnxruntime_backend code, but it insists on using the CPU instead of the GPU, according to these log messages I see at startup. Note that this occurred in the process of loading the densenet_onnx model:

I0412 18:13:21.078968 1 model_lifecycle.cc:461] loading: densenet_onnx:1
I0412 18:13:21.079752 1 onnxruntime.cc:2780] TRITONBACKEND_Initialize: onnxruntime
I0412 18:13:21.079761 1 onnxruntime.cc:2790] Triton TRITONBACKEND API version: 1.16
I0412 18:13:21.079768 1 onnxruntime.cc:2796] 'onnxruntime' TRITONBACKEND API version: 1.16
I0412 18:13:21.079774 1 onnxruntime.cc:2826] backend configuration:
{"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}}
I0412 18:13:21.110257 1 onnxruntime.cc:2891] TRITONBACKEND_ModelInitialize: densenet_onnx (version 1)
I0412 18:13:21.110927 1 onnxruntime.cc:826] skipping model configuration auto-complete for 'densenet_onnx': inputs and outputs already specified
I0412 18:13:21.111875 1 onnxruntime.cc:2956] TRITONBACKEND_ModelInstanceInitialize: densenet_onnx_0 (**CPU device 0**)

Need to understand why the server decided CPU device 0 was selected instead of the GPU. Apparently it's considered an attribute of the model file, even though our example densenet model configuration doesn't mention device kind. This log message is created when server reads a model config from the config.pbtxt file and instantiates the ModelInstance class. The code is located in the onnxruntime_backend repo. The member variable being referenced is BackendModelInstance::_kind and its type is described in backend/src/backend_model_instance.cc (that is, code in the backend repository and not in the onnxruntime_backend repository):

/// TRITONSERVER_InstanceGroupKind
///
/// Kinds of instance groups recognized by TRITONSERVER.
///
typedef enum TRITONSERVER_instancegroupkind_enum {
  TRITONSERVER_INSTANCEGROUPKIND_AUTO,
  TRITONSERVER_INSTANCEGROUPKIND_CPU,
  TRITONSERVER_INSTANCEGROUPKIND_GPU,
  TRITONSERVER_INSTANCEGROUPKIND_MODEL
} TRITONSERVER_InstanceGroupKind;

Learning about the use of the InstanceGroupKind and potentially a new implementation of it for AMD GPU's looks to be a significant investigation in itself.

TedThemistokleous commented 5 months ago

That doesn't make sense. It should be agnostic of what the model file says as you can specify a device when using the API.

In the final image before you run an inference is there linkage to MIGraphX.so at all for Onnxruntime?

Can you go in the image itself and check avainabile execution providers via

python3
import onnxruntime as ort

ort.get_available_providers()

We should see all our providers (CPU, ROCm MIGraphX). Either we're loading in something into the CPU EP by default when the Session is invoked by triton but they all should be using the same C++ API through Onnxruntime.

bpickrel commented 5 months ago

/opt/tritonserver/backends/onnxruntime/libonnxruntime_providers_migraphx.so

bpickrel commented 5 months ago

Recap of how to run/debug our code. The following is the result of trial-and-error learning over the past few weeks:

To build, run the Python script build.py. This creates a script cmake_build as well as a Docker image named tritonserver_buildbase, and runs the script in the Docker image. This in turn builds two other Docker images named tritonserver_cibase and tritonserver. The latter is the one to run inferences on.

The build script cmake_build checks out the two backend repositories, inside of Docker, during the build process. We've set parameters that select Ted's Github location and the branches add_migraphx_rocm_eps (same branch name for both backend repos). This means:

  1. If you're just building, you don't have to bother checking out onnxruntime-backend and backend yourself.
  2. If you want to edit code in either onnxruntime-backend or backend you must not only check out a copy into your bash environment for editing, but push your changes to Github for cmake_build to get within Docker.

For debugging, you can sidestep the automated process by running the tritonserver_buildbase Docker image again either with the same script or interactively. Inside a bash shell, you can run cmake_build or cut/paste individual lines from it into the command line. Suggest you skip over these first few lines when debugging, or they'll cause confusing exits:

# Exit script immediately if any command fails
set -e
set -x

Sample command lines to start up the Docker container for build:

docker run -w /workspace/build --name brian_ter_x -it --rm  -v /var/run/docker.sock:/var/run/docker.sock tritonserver_buildbase /bin/bash

or

docker run -w /workspace/build --name brian_ter_x -it --rm  -v /var/run/docker.sock:/var/run/docker.sock tritonserver_buildbase ./cmake_build

Beware of giving your debugging Docker instances names containing "tritonserver" because it's bug-prone: the build script does a grep search for existing instances and may not build the wanted Docker image if there's any name clutter.

Sample command line to start the server (the server-side portion of the same example inference as in the comment dated Nov. 17):

docker run --name brians_container --device=/dev/kfd --device=/dev/dri -it -e LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/conda/envs/py_3.10/lib:/opt/tritonserver/backends/onnxruntime --rm --net=host -v /home/bpickrel/Triton-server/ted_repo/server/docs/examples/model_repository:/models tritonserver  tritonserver --model-repository=/models --exit-on-error=false

Note that both the Docker image and the command it runs are called tritonserver. In this example, I've replaced the Nvidia-centric command switch for finding the GPU device driver --gpus=all with our own --device=/dev/kfd --device=/dev/dri. I've left off without figuring out why it fails to find the GPU regardless.

Finally, the arguments I passed to the Python script build.py. You can enter these on the command line or run it from Visual Studio Code by pasting the following into file launch.json as I did. Here's the whole file, but the args: block is the only critical part:

{
    // Use IntelliSense to learn about possible attributes.
    // Hover to view descriptions of existing attributes.
    // For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
    "version": "0.2.0",
    "configurations": [
        {
            "name": "Python Debugger: Current File with Arguments",
            "type": "debugpy",
            "request": "launch",
            "program": "${file}",
            "console": "integratedTerminal",
            "args": [ "--no-container-pull", "--enable-logging",
             "--enable-stats", "--enable-tracing",  
             "--enable-rocm",
            //  "--enable-gpu",
             "--enable-metrics",
             // "--enable-cpu-metrics=false",
             "--verbose",
             "--no-core-build" ,
             "--endpoint=grpc", "--image=gpu-base,rocm/pytorch:rocm6.0.2_ubuntu22.04_py3.10_pytorch_2.1.2", 
             "--ort_organization=https://github.com/TedThemistokleous", "--ort_branch=add_migraphx_rocm_onnxrt_eps",
            "--endpoint=http", "--backend=onnxruntime:main", "--library-paths=../onnxruntime_backend/"]

        }
    ]
}
bpickrel commented 5 months ago

Putting this project aside as it's proving to be too long of a job to get it working. The previous comment explains how to pick up again, if we decide to restart it.