Open bpickrel opened 1 year ago
@TedThemistokleous @bpickrel I wen't through your guide, but when I try to run tritonserver I get the following error:
"tritonserver": executable file not found in $PATH: unknown.
It seems like the binary is not in the onnxruntime backend directory.
Also if I start the same container without the tritonserver
command I get the following message:
bash: /opt/conda/envs/py_3.10/lib/libtinfo.so.6: no version information available (required by bash)
Did any of you encounter these issues?
The conda no version information available
message you saw has to do with an environment variable I set in the Triton command line: -e LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/conda/envs/py_3.10/lib:/opt/tritonserver/backends/onnxruntime
which was itself a workaround but I don't exactly recall the issue. If you change the ...py_3.10...
portion of that to ...py_3.06...
the message goes away. I also don't know if it is an actual problem or just an informational message.
@bpickrel what you mentioned fixes the message, but looks like it's not the main issue. The binary is still missing form the docker.
https://github.com/TedThemistokleous/server/blob/add_migraphx_rocm_hooks_v2.39.0/src/CMakeLists.txt#L53
This needs to be changed to GIT_REPOSITORY https://github.com/TedThemistokleous/backend.git
My guess when it was first built, this org separation was not present. And now it points to a non-existing branch during build.
I saw some issues regarding missing dependencies (e.g. libssh2) that needs to be resolved when running the server.
But at least the build issue for the "core" bin seems to be resolved.
The libssh2 issue is solved when using ...py_3.10...
, so we can keep that. The no version information available
message doesn't interfer with the tritonserver start.
After forcing the GPU kind in onnxruntime_ backend
, disabling autoconfig skip (not sure if this needed, from the code it looked like this skip blocks the loading of migraphx provider) and extending the 'config.pbtxt' file for densenet with this:
optimization {
execution_accelerators {
gpu_execution_accelerator : [ {
name : "migraphx"
parameters { key: "precision_mode" value: "FP32" }
}]
}
}
I've got the following error:
E0522 13:36:49.042809 1879 model_lifecycle.cc:621] failed to load 'densenet_onnx' version 1: Internal: onnx runtime error 6: /workspace/onnxruntime/onnxruntime/core/session/provider_bridg
e_ort.cc:1209 onnxruntime::Provider& onnxruntime::ProviderLibrary::Get() [ONNXRuntimeError] : 1 : FAIL : Failed to load library libonnxruntime_providers_migraphx.so with error: libmigraph
x_c.so.3: cannot open shared object file: No such file or directory
Don't know; I haven't tried that. Ted is out for a while.
We managed to get pass through the missing lib issues. I had to update the onnxruntime_backend to copy migraphx .so files over the tritonserver container: https://github.com/gyulaz-htec/onnxruntime_backend/commit/c108443dfb5c73d0b2f49b36aec2d21f191911d5#diff-ad5e480d1a7be6ef5d700f428d3f7da2559e36ee630e6a848a959dcdc3753832R424
I tried to force migraphx provider with GPU in the code but it still fails back to CPU.
Unfortunatelly tritonserver automatically extends the config file and sets the instance_group
kind to CPU_KIND
:
"instance_group": [
{
"name": "densenet_onnx",
"kind": "KIND_CPU",
"count": 2,
"gpus": [],
...
}
],
We should see KIND_GPU
there, but the core API is not allowing it.
We suspect this part of the core API must be updated to fix this issue:
https://github.com/triton-inference-server/core/blob/bbcd7816997046821f9d1a22e418acb84ca5364b/src/model_config_utils.cc#L1626-L1630
You would probably have to hipify the Cuda core codebase, or at least such routines as GetSupportedGPUs()
, and set the appropriate build repo tag to use the changed version when building. I tried running hipify-perl
on the entire code base but did not follow up to investigate why it didn't instantly make the GPU work. I can report that running hipify-perl
on the entire code base is EXTREMELY slow (~12 hours) and you should probably either run it overnight or take the time to write a script to hipify multiple files in separate threads or processes.
We've managed to run the densenet_onnx
with MIGraphX provider on the GPU:
The code is available on my fork: https://github.com/gyulaz-htec/server/tree/add_migraphx_rocm_hooks_v2.39.0
Steps to run tritonserver:
git clone git@github.com:gyulaz-htec/server.git
cd server
git switch add_migraphx_rocm_hooks_v2.39.0
# fetch densenet_onnx model
./docs/examples/fetch_models.sh
# building tritonserver dcoker image
python3 build.py --no-container-pull --enable-logging --enable-stats --enable-tracing --enable-rocm --enable-metrics --verbose --endpoint=grpc --image='gpu-base,rocm/pytorch:rocm6.0.2_ubuntu22.04_py3.10_pytorch_2.1.2' --ort_organization=https://github.com/gyulaz-htec --ort_branch=add_migraphx_rocm_onnxrt_eps --endpoint=http --backend=onnxruntime:main --library-paths=../onnxruntime_backend/
# starting tritonserver inside docker image
docker run --name gyulas_container --device=/dev/kfd --device=/dev/dri -it -e LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/conda/envs/py_3.10/lib:/opt/tritonserver/backends/onnxruntime:/opt/rocm-6.0.2/lib --rm --net=host -v /home/htec/gyulaz/triton/server/docs/examples/model_repository/:/models tritonserver tritonserver --model-repository=/models/ --exit-on-error=false
For testing I've used the imagenet2012 500 dataset from the nvcr.io/nvidia/pytorch:24.02-py3
image:
# start docker
docker run -it --rm --net=host nvcr.io/nvidia/tritonserver:24.04-py3-sdk /bin/bash
# download imagenet dataset
mkdir images
cd images
wget https://www.dropbox.com/s/57s11df6pts3z69/ILSVRC2012_img_val_500.tar
tar -xvf ./ILSVRC2012_img_val_500.tar
# remove .tar because the imagenet_client can't parse it
rm ILSVRC2012_img_val_500.tar
# run imagenet client
/workspace/install/bin/image_client -m densenet_onnx -c 3 -s INCEPTION /workspace/images/
We also already test with the ILSVRC2012 dataset for resnet50 as well as part of our examples with Onnxruntime inference examples repo found here:
You can reuse the dataset to get numbers for this so we can compare to existing runs using this larger dataset
For Bert I've got a fork with an example that can be used for fp16/int8 runs for Bert. Avoid mixed precision for this
This will let us compare Bert quickly and I believe can also evaluate performance output as well if you don't use the no_eval flag.
The code is available on my fork: https://github.com/gyulaz-htec/server/tree/add_migraphx_rocm_hooks_v2.39.0
git clone git@github.com:gyulaz-htec/server.git
cd server
git switch add_migraphx_rocm_hooks_v2.39.0
# fetch models and setup folder structure
./docs/examples/fetch_models.sh
# ResNet50 ONNX model is only available from ONNX zoo. You have to download it manally with git-lfs
# from https://github.com/onnx/models/blob/main/validated/vision/classification/resnet/model/resnet50-v2-7.onnx
# and copy it under docs/examples/model_repository/resnet50_onnx/1
git clone https://github.com/onnx/models
cd models
git lfs pull --include="/validated/vision/classification/resnet/model/resnet50-v2-7.onnx" --exclude=""
cp validated/vision/classification/resnet/model/resnet50-v2-7.onnx /path/to/triton/server/docs/examples/model_repository/resnet50_onnx/1
# building tritonserver docker image from triton-server folder
python3 build.py --no-container-pull --enable-logging --enable-stats --enable-tracing --enable-rocm --enable-metrics --verbose --endpoint=grpc --image='gpu-base,rocm/pytorch:rocm6.0.2_ubuntu22.04_py3.10_pytorch_2.1.2' --ort_organization=https://github.com/gyulaz-htec --ort_branch=add_migraphx_rocm_onnxrt_eps --endpoint=http --backend=onnxruntime:main --library-paths=../onnxruntime_backend/
# starting tritonserver inside docker image
docker run --name gyulas_container --device=/dev/kfd --device=/dev/dri -it -e LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/conda/envs/py_3.10/lib:/opt/tritonserver/backends/onnxruntime:/opt/rocm-6.0.2/lib --rm --net=host -v /home/htec/gyulaz/triton/server/docs/examples/model_repository/:/models tritonserver tritonserver --model-repository=/models/ --exit-on-error=false
Client code is available on my fork: https://github.com/gyulaz-htec/client/blob/migraphx_resnet50/src/python/examples/resnet50_image_client.py
# Download and extract ILSVRC2012 validation dataset
mkdir ILSVRC2012 && cd ILSVRC2012
wget https://image-net.org/data/ILSVRC/2012/ILSVRC2012_img_val.tar
tar -xvf ILSVRC2012_img_val.tar -C ./cal
# Download 'synset_words.txt'
wget https://raw.githubusercontent.com/HoldenCaulfieldRye/caffe/master/data/ilsvrc12/synset_words.txt
# Get development kit files 'ILSVRC2012_validation_ground_truth.txt' and 'meta.mat'.
mkdir devkit && cd devkit
wget https://raw.githubusercontent.com/miraclewkf/MobileNetV2-PyTorch/master/ImageNet/ILSVRC2012_devkit_t12/data/ILSVRC2012_validation_ground_truth.txt
wget https://github.com/miraclewkf/MobileNetV2-PyTorch/raw/master/ImageNet/ILSVRC2012_devkit_t12/data/meta.mat
# start docker image
docker run -it --rm --net=host -v /path/to/ILSVRC2012/:/workspace/ILSVRC2012 nvcr.io/nvidia/tritonserver:23.09-py3-sdk /bin/bash
# get resnet50 client code and move it to the proper path
wget https://raw.githubusercontent.com/gyulaz-htec/client/migraphx_resnet50/src/python/examples/resnet50_image_client.py
mv resnet50_image_client.py client/src/python/examples
# start resnet50 client with grpc and asnyc mode
python3 client/src/python/examples/resnet50_image_client.py -m resnet50_onnx -c 1 ./ILSVRC2012 -b 20 --async -c 5 -u localhost:8001 -i grpc
Current triton-server results with ResNet50 compared to end to end ORT (MIGraphX) example
The comparison was done using ImageNet2012 50k image dataset Precision: FP32
Note that the server and client are running on the same machine, so response delay will be larger in real life application
| ORT (MGX provider) | Triton server (ORT with MGX provider) | ||||
---|---|---|---|---|---|---|
mode | sync | http(sync) | http(async) | grpc(sync) | grpc(stream) | grpc(async) |
Inference duration (50k images) | 42.52 | 156.38 | 65.89 | 159.14 | 48.33 | 44.00 |
average delay (ms) | 17.008 | 62.552 | 26.354 | 63.65 | 19.332 | 17.60 |
Where is the parameter execution_accelerators
used?
Can this be done by leveraging the onnxruntime work we already have as a back end?
As a preliminary step, learn to add a Cuda back end, then change it to MIGraphX/ROCm
See https://github.com/triton-inference-server/onnxruntime_backend and https://github.com/triton-inference-server/onnxruntime_backend#onnx-runtime-with-tensorrt-optimization
Documentation for building the back end is at server docs Development Build of Backend or Repository Agent