triton model breaks serving instance

stephanbertl commented 1 year ago

We have setup clearml serving on Kubernetes including triton support. Our triton instance has no GPU, so deploying a model leads to the following error in the triton instance:

E0718 07:41:21.083440 30 model_lifecycle.cc:596] failed to load 'distilbert-test2' version 1: Invalid argument: unable to load model 'distilbert-test2', TensorRT backend supports only GPU device

Trying to remove the model again is not possible: clearml-serving --id 5097f44fe9cb45f7be2a917c6fe8cad9 model remove --endpoint distilbert-test2

yields the following:

`clearml-serving - CLI for launching ClearML serving engine 2023-07-18 09:47:59,260 - clearml.Task - ERROR - Failed reloading task 5097f44fe9cb45f7be2a917c6fe8cad9 2023-07-18 09:47:59,290 - clearml.Task - ERROR - Failed reloading task 5097f44fe9cb45f7be2a917c6fe8cad9

Error: Task ID "5097f44fe9cb45f7be2a917c6fe8cad9" could not be found `

In general, our observation is that the serving is not resilient against these kind of problems. A broken model should not break the instance.

jkhenning commented 1 year ago

Hi @stephanbertl, thanks for this report. We will look into it 🙂

stephanbertl commented 10 months ago

any update? The serving module seems totally unstable, a model that is not working breaks the whole serving server. How is that supposed to work in prod?

jkhenning commented 10 months ago

Hi @stephanbertl, I have not managed to reproduce this, can you perhaps provide some more information? Specifically, I assume you're using the serving helm chart, is that correct? Can you share how you configured it?

stephanbertl commented 5 months ago

@jkhenning sorry for not coming back earlier to you.

I would say the culprit is the tritonserver default value of --exit-on-error=true.

I quickly checked the code and I could not found a way to set this in clearm-serving.

allegroai / clearml-serving

triton model breaks serving instance #60