Closed kpedro88 closed 4 years ago
tested on ucsd machine at least for tritonserver-20.06-v1-py3-geometric, I was still getting errors. running with:
TMPDIR=/scratch/data/mliu/tmp singularity instance start \
-B ./artifacts/models/:/models \
--hostname gattestserver --writable \
tritonserver-20.06-v1-py3-geometric/ gat_test_server
TMPDIR=/scratch/data/mliu/tmp singularity run --nv instance://gat_test_server \
tritonserver --model-repository=/models >& gat_test_server.log &
sleep 2
TMPDIR=/scratch/data/mliu/tmp singularity run -B pwd
/client:/inputs \
--disable-cache docker://nvcr.io/nvidia/tritonserver:20.06-py3-clientsdk \
python /inputs/client.py -m gat_test -u localhost:8001
[error] tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.11.0 I0909 19:10:50.794980 21 metrics.cc:164] found 6 GPUs supporting NVML metrics I0909 19:10:50.801559 21 metrics.cc:173] GPU 0: GeForce GTX 1080 Ti I0909 19:10:50.808530 21 metrics.cc:173] GPU 1: GeForce GTX 1080 Ti I0909 19:10:50.815280 21 metrics.cc:173] GPU 2: GeForce GTX 1080 Ti I0909 19:10:50.822078 21 metrics.cc:173] GPU 3: GeForce GTX 1080 Ti I0909 19:10:50.829149 21 metrics.cc:173] GPU 4: GeForce GTX 1080 Ti I0909 19:10:50.836247 21 metrics.cc:173] GPU 5: GeForce GTX 1080 Ti I0909 19:10:50.836589 21 server.cc:127] Initializing Triton Inference Server E0909 19:10:52.002003 21 server.cc:168] failed to enable peer access for some device pairs I0909 19:10:52.018406 21 server_status.cc:55] New status tracking for model 'gat_test' I0909 19:10:52.018501 21 model_repository_manager.cc:723] loading: gat_test:1 I0909 19:10:52.026666 21 libtorch_backend.cc:232] Creating instance gat_test_0_gpu0 on GPU 0 (6.1) using model.pt I0909 19:11:23.125336 21 libtorch_backend.cc:232] Creating instance gat_test_0_gpu1 on GPU 1 (6.1) using model.pt I0909 19:11:53.259708 21 libtorch_backend.cc:232] Creating instance gat_test_0_gpu2 on GPU 2 (6.1) using model.pt I0909 19:12:21.654204 21 libtorch_backend.cc:232] Creating instance gat_test_0_gpu3 on GPU 3 (6.1) using model.pt I0909 19:12:50.234227 21 libtorch_backend.cc:232] Creating instance gat_test_0_gpu4 on GPU 4 (6.1) using model.pt I0909 19:13:18.267221 21 libtorch_backend.cc:232] Creating instance gat_test_0_gpu5 on GPU 5 (6.1) using model.pt I0909 19:13:46.825164 21 model_repository_manager.cc:888] successfully loaded 'gat_test' version 1 Starting endpoints, 'inference:0' listening on I0909 19:13:46.828337 21 grpc_server.cc:1942] Started GRPCService at 0.0.0.0:8001 I0909 19:13:46.828415 21 http_server.cc:1428] Starting HTTPService at 0.0.0.0:8000 I0909 19:13:46.870846 21 http_server.cc:1443] Starting Metrics Service at 0.0.0.0:8002
And the server isn't running properly when I check with curl.
@mialiu149 sleep 2
might not be long enough? Otherwise, can you clarify the specific error you observe? Most of this just looks like the standard log messages printed by the server.
@mialiu149 are you testing it from a remote machine or on that same machine?
It looks like it's bound to 0.0.0.0 rather than an external-facing ip.
so the log says that server is running, and also checked with singularity. singularity instance list INSTANCE NAME PID IP IMAGE gat_test_server 23824 /scratch/data/mliu/triton-torchgeo-gat-example_singularity/tritonserver-20.06-v1-py3-geometric
if I check with curl: curl -v localhost:8000/v2/health/ready
GET /v2/health/ready HTTP/1.1 User-Agent: curl/7.29.0 Host: localhost:8000 Accept: /
< HTTP/1.1 400 Bad Request < Content-Length: 0 < Content-Type: text/plain <
running a local client now also throws errors:
INFO: Creating SIF file...
Traceback (most recent call last):
File "/inputs/client.py", line 65, in
Ah!
tritonserver-20.06-v1-py3-geometric
is the triton version 1 API, the tests are in the version 2 API.
This is the tritonserver-geometric image you want to use for interacting with CMSSW.
For testing with the python scripts you want to use tritonserver-20.06-py3-geometric
, which has the triton api V2 server.
Buttt, the v1 would work with cmssw?
On Thu, Sep 17, 2020 at 12:01 PM Lindsey Gray notifications@github.com wrote:
Ah!
tritonserver-20.06-v1-py3-geometric is the triton version 1 API, the tests are in the version 2 API.
This is the tritonserver-geometric image you want to use for interacting with CMSSW.
You want to use tritonserver-20.06-py3-geometric, which has the triton api V2 server.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hls-fpga-machine-learning/SonicCMS/issues/11#issuecomment-694332630, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABXRBMFPJOS4SBYOF5WUGV3SGIXHDANCNFSM4RMCGD7A .
Ah. Yes. I will try from another machine with cmssw
On Thu, Sep 17, 2020 at 3:23 PM Mia Liu miaoyuan.liu0@gmail.com wrote:
Buttt, the v1 would work with cmssw?
On Thu, Sep 17, 2020 at 12:01 PM Lindsey Gray notifications@github.com wrote:
Ah!
tritonserver-20.06-v1-py3-geometric is the triton version 1 API, the tests are in the version 2 API.
This is the tritonserver-geometric image you want to use for interacting with CMSSW.
You want to use tritonserver-20.06-py3-geometric, which has the triton api V2 server.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hls-fpga-machine-learning/SonicCMS/issues/11#issuecomment-694332630, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABXRBMFPJOS4SBYOF5WUGV3SGIXHDANCNFSM4RMCGD7A .
Sorry, but then why does the server check was unhealthy with curl?
On Thu, Sep 17, 2020 at 3:24 PM Mia Liu miaoyuan.liu0@gmail.com wrote:
Ah. Yes. I will try from another machine with cmssw
On Thu, Sep 17, 2020 at 3:23 PM Mia Liu miaoyuan.liu0@gmail.com wrote:
Buttt, the v1 would work with cmssw?
On Thu, Sep 17, 2020 at 12:01 PM Lindsey Gray notifications@github.com wrote:
Ah!
tritonserver-20.06-v1-py3-geometric is the triton version 1 API, the tests are in the version 2 API.
This is the tritonserver-geometric image you want to use for interacting with CMSSW.
You want to use tritonserver-20.06-py3-geometric, which has the triton api V2 server.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hls-fpga-machine-learning/SonicCMS/issues/11#issuecomment-694332630, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABXRBMFPJOS4SBYOF5WUGV3SGIXHDANCNFSM4RMCGD7A .
The first try it's attempting to establish a connection with ipv6, that fails since it's not bound, and then it tried ipv4 and succeeds.
And yeah, do python tests with py3-geometric
and cmssw tests with v1-py3-geometric
the torch related libraries are the same between the two so the models will work the same with either one. It's only tritonserver api that's different.
worked with CMSSW from a remote client. not sure why does the server check was unhealthy with curl.
@mialiu149 it's succeeding but on the second try if you look at the logs you posted, for some reason it's trying an ipv6 address first (which isn't bound) which fails and then it tries 127.0.0.1:8000 and succeeds.
Try it again with curl -4 -v localhost:8000/v2/health/ready
Progress on this issue:
@lgray @mialiu149 let me know if you have any feedback before I submit the PR. The cmslpcgpu nodes are a good place to test both CPU and GPU modes (CPU mode can't be tested on normal cmslpc, because the AMD Opterons don't support AVX).
Now merged.
Currently, the SonicTriton examples in HeterogeneousCore/SonicTriton/test require Docker for standalone use (setting up a local server). Because Docker require superuser permission on Linux, it's preferable to use a Singularity container. An example of building a Singularity container for Triton can be found at lgray/triton-torchgeo-gat-example.
Assigned to: @kpedro88, @lgray