fastmachinelearning / SonicCMS

Services for Optimized Network Inference on Coprocessors (for CMS)
8 stars 8 forks source link

Use Singularity for SonicTriton examples #11

Closed kpedro88 closed 4 years ago

kpedro88 commented 4 years ago

Currently, the SonicTriton examples in HeterogeneousCore/SonicTriton/test require Docker for standalone use (setting up a local server). Because Docker require superuser permission on Linux, it's preferable to use a Singularity container. An example of building a Singularity container for Triton can be found at lgray/triton-torchgeo-gat-example.

Assigned to: @kpedro88, @lgray

mialiu149 commented 4 years ago

tested on ucsd machine at least for tritonserver-20.06-v1-py3-geometric, I was still getting errors. running with: TMPDIR=/scratch/data/mliu/tmp singularity instance start \ -B ./artifacts/models/:/models \ --hostname gattestserver --writable \ tritonserver-20.06-v1-py3-geometric/ gat_test_server TMPDIR=/scratch/data/mliu/tmp singularity run --nv instance://gat_test_server \ tritonserver --model-repository=/models >& gat_test_server.log & sleep 2 TMPDIR=/scratch/data/mliu/tmp singularity run -B pwd/client:/inputs \ --disable-cache docker://nvcr.io/nvidia/tritonserver:20.06-py3-clientsdk \ python /inputs/client.py -m gat_test -u localhost:8001

[error] tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.11.0 I0909 19:10:50.794980 21 metrics.cc:164] found 6 GPUs supporting NVML metrics I0909 19:10:50.801559 21 metrics.cc:173] GPU 0: GeForce GTX 1080 Ti I0909 19:10:50.808530 21 metrics.cc:173] GPU 1: GeForce GTX 1080 Ti I0909 19:10:50.815280 21 metrics.cc:173] GPU 2: GeForce GTX 1080 Ti I0909 19:10:50.822078 21 metrics.cc:173] GPU 3: GeForce GTX 1080 Ti I0909 19:10:50.829149 21 metrics.cc:173] GPU 4: GeForce GTX 1080 Ti I0909 19:10:50.836247 21 metrics.cc:173] GPU 5: GeForce GTX 1080 Ti I0909 19:10:50.836589 21 server.cc:127] Initializing Triton Inference Server E0909 19:10:52.002003 21 server.cc:168] failed to enable peer access for some device pairs I0909 19:10:52.018406 21 server_status.cc:55] New status tracking for model 'gat_test' I0909 19:10:52.018501 21 model_repository_manager.cc:723] loading: gat_test:1 I0909 19:10:52.026666 21 libtorch_backend.cc:232] Creating instance gat_test_0_gpu0 on GPU 0 (6.1) using model.pt I0909 19:11:23.125336 21 libtorch_backend.cc:232] Creating instance gat_test_0_gpu1 on GPU 1 (6.1) using model.pt I0909 19:11:53.259708 21 libtorch_backend.cc:232] Creating instance gat_test_0_gpu2 on GPU 2 (6.1) using model.pt I0909 19:12:21.654204 21 libtorch_backend.cc:232] Creating instance gat_test_0_gpu3 on GPU 3 (6.1) using model.pt I0909 19:12:50.234227 21 libtorch_backend.cc:232] Creating instance gat_test_0_gpu4 on GPU 4 (6.1) using model.pt I0909 19:13:18.267221 21 libtorch_backend.cc:232] Creating instance gat_test_0_gpu5 on GPU 5 (6.1) using model.pt I0909 19:13:46.825164 21 model_repository_manager.cc:888] successfully loaded 'gat_test' version 1 Starting endpoints, 'inference:0' listening on I0909 19:13:46.828337 21 grpc_server.cc:1942] Started GRPCService at 0.0.0.0:8001 I0909 19:13:46.828415 21 http_server.cc:1428] Starting HTTPService at 0.0.0.0:8000 I0909 19:13:46.870846 21 http_server.cc:1443] Starting Metrics Service at 0.0.0.0:8002

And the server isn't running properly when I check with curl.

kpedro88 commented 4 years ago

@mialiu149 sleep 2 might not be long enough? Otherwise, can you clarify the specific error you observe? Most of this just looks like the standard log messages printed by the server.

lgray commented 4 years ago

@mialiu149 are you testing it from a remote machine or on that same machine?

It looks like it's bound to 0.0.0.0 rather than an external-facing ip.

mialiu149 commented 4 years ago

so the log says that server is running, and also checked with singularity. singularity instance list INSTANCE NAME PID IP IMAGE gat_test_server 23824 /scratch/data/mliu/triton-torchgeo-gat-example_singularity/tritonserver-20.06-v1-py3-geometric

if I check with curl: curl -v localhost:8000/v2/health/ready

running a local client now also throws errors: INFO: Creating SIF file... Traceback (most recent call last): File "/inputs/client.py", line 65, in mconf = triton_client.get_model_config(model_name, as_json=True) File "/usr/local/lib/python3.6/dist-packages/tritongrpcclient/init.py", line 391, in get_model_config raise_error_grpc(rpc_error) File "/usr/local/lib/python3.6/dist-packages/tritongrpcclient/init.py", line 49, in raise_error_grpc raise get_error_grpc(rpc_error) from None tritonclientutils.InferenceServerException: [StatusCode.UNIMPLEMENTED]

lgray commented 4 years ago

Ah!

tritonserver-20.06-v1-py3-geometric is the triton version 1 API, the tests are in the version 2 API. This is the tritonserver-geometric image you want to use for interacting with CMSSW.

For testing with the python scripts you want to use tritonserver-20.06-py3-geometric, which has the triton api V2 server.

mialiu149 commented 4 years ago

Buttt, the v1 would work with cmssw?

On Thu, Sep 17, 2020 at 12:01 PM Lindsey Gray notifications@github.com wrote:

Ah!

tritonserver-20.06-v1-py3-geometric is the triton version 1 API, the tests are in the version 2 API.

This is the tritonserver-geometric image you want to use for interacting with CMSSW.

You want to use tritonserver-20.06-py3-geometric, which has the triton api V2 server.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hls-fpga-machine-learning/SonicCMS/issues/11#issuecomment-694332630, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABXRBMFPJOS4SBYOF5WUGV3SGIXHDANCNFSM4RMCGD7A .

mialiu149 commented 4 years ago

Ah. Yes. I will try from another machine with cmssw

On Thu, Sep 17, 2020 at 3:23 PM Mia Liu miaoyuan.liu0@gmail.com wrote:

Buttt, the v1 would work with cmssw?

On Thu, Sep 17, 2020 at 12:01 PM Lindsey Gray notifications@github.com wrote:

Ah!

tritonserver-20.06-v1-py3-geometric is the triton version 1 API, the tests are in the version 2 API.

This is the tritonserver-geometric image you want to use for interacting with CMSSW.

You want to use tritonserver-20.06-py3-geometric, which has the triton api V2 server.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hls-fpga-machine-learning/SonicCMS/issues/11#issuecomment-694332630, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABXRBMFPJOS4SBYOF5WUGV3SGIXHDANCNFSM4RMCGD7A .

mialiu149 commented 4 years ago

Sorry, but then why does the server check was unhealthy with curl?

On Thu, Sep 17, 2020 at 3:24 PM Mia Liu miaoyuan.liu0@gmail.com wrote:

Ah. Yes. I will try from another machine with cmssw

On Thu, Sep 17, 2020 at 3:23 PM Mia Liu miaoyuan.liu0@gmail.com wrote:

Buttt, the v1 would work with cmssw?

On Thu, Sep 17, 2020 at 12:01 PM Lindsey Gray notifications@github.com wrote:

Ah!

tritonserver-20.06-v1-py3-geometric is the triton version 1 API, the tests are in the version 2 API.

This is the tritonserver-geometric image you want to use for interacting with CMSSW.

You want to use tritonserver-20.06-py3-geometric, which has the triton api V2 server.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hls-fpga-machine-learning/SonicCMS/issues/11#issuecomment-694332630, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABXRBMFPJOS4SBYOF5WUGV3SGIXHDANCNFSM4RMCGD7A .

lgray commented 4 years ago

The first try it's attempting to establish a connection with ipv6, that fails since it's not bound, and then it tried ipv4 and succeeds.

lgray commented 4 years ago

And yeah, do python tests with py3-geometric and cmssw tests with v1-py3-geometric the torch related libraries are the same between the two so the models will work the same with either one. It's only tritonserver api that's different.

mialiu149 commented 4 years ago

worked with CMSSW from a remote client. not sure why does the server check was unhealthy with curl.

lgray commented 4 years ago

@mialiu149 it's succeeding but on the second try if you look at the logs you posted, for some reason it's trying an ipv6 address first (which isn't bound) which fails and then it tries 127.0.0.1:8000 and succeeds.

lgray commented 4 years ago

Try it again with curl -4 -v localhost:8000/v2/health/ready

kpedro88 commented 4 years ago

Progress on this issue:

  1. Followed https://github.com/lgray/triton-torchgeo-gat-example to build Docker container w/ PyTorch libraries
  2. Container now hosted on a FastML DockerHub account: https://hub.docker.com/repository/docker/fastml/triton-torchgeo
  3. Submitted PR to have Docker containers from that repo automatically converted to Singularity and hosted on cvmfs: https://gitlab.cern.ch/unpacked/sync/-/merge_requests/58 (the automatic singularity build command is very similar to @lgray's repo: https://github.com/cvmfs/cvmfs/blob/ff7728530936f3ef93bd5578cd9933bdc480be81/ducc/lib/image.go#L358)
  4. Combined all commands and options into a single script: https://github.com/cms-sw/cmssw/compare/master...kpedro88:SonicTriton4

@lgray @mialiu149 let me know if you have any feedback before I submit the PR. The cmslpcgpu nodes are a good place to test both CPU and GPU modes (CPU mode can't be tested on normal cmslpc, because the AMD Opterons don't support AVX).

kpedro88 commented 4 years ago

See: https://github.com/cms-sw/cmssw/pull/31616

kpedro88 commented 4 years ago

Now merged.