NVIDIA / ai-assisted-annotation-client

Client side integration example source code and libraries for AI-Assisted Annotation SDK
Other
307 stars 64 forks source link

AIAA server stops responding after a few days #90

Closed lassoan closed 3 years ago

lassoan commented 3 years ago

In every few days, the AIAA server stops responding to model requests (http://perklabseg.asuscomm.com:5000/v1/models times out), while the server API tester (http://perklabseg.asuscomm.com:5000/) and logs (http://perklabseg.asuscomm.com:5000/logs/) work OK. It recovers by itself after about 5-10 minutes. See the logs here: https://pastebin.com/9kpDn9WW (search for >>>>> to see where the error happened and when the server recovered).

lassoan commented 3 years ago

The server stopped responding again. This time it has not recovered by itself. See logs here: https://pastebin.com/zsuqATe7

Stopping and starting the server fixed the issue.

These outages and the need for manual restarting of the server from time to time are quite inconvenient. Do you have any recommendation on how to fix this?

SachidanandAlle commented 3 years ago

@YuanTingHsieh can you take a look on this issue?

YuanTingHsieh commented 3 years ago

@SachidanandAlle Thanks for tagging me.

@lassoan Thanks for the report, the AIAA server now do have a bug that is only related to the implementation of how AIAA utilize "grpc" communication protocol with the Triton server. This bug only happens when Triton server is restarting / not responding at a certain moment.

A quick workaround on your side is, if you are using the docker-compose to start it, you can just modify the variable (triton protocol) from TRITON_PROTO=grpc to TRITON_PROTO=http in your docker-compose.env file.

We will including a few bug fixes in the next release (4.1), which I will make sure this is fixed.

lassoan commented 3 years ago

Thanks a lot for your help, I've changed the protocol to http. I'll close the issue now (and reopen in case it occurs again).