Open edfincham opened 1 year ago
I am seeing what I think is a related issue. This is all in kind, deployed with latest helm charts this morning.
Setup:
Action:
Result:
Workaround:
Expectation: Deleting the model should actually delete the model and with all models removed, the sever delete should go through.
Please let me know if my expectation is out of line or my issue is not related to the original post. Thanks!
Hey @kevinnowland, accidentally deleting a server with associated models before deleting the models themselves is definitely a scenario I have come across (and your workaround is what I've done to remedy it). However, in this case I am talking about deleting the models first :slightly_smiling_face:
@edfincham True, that's a big difference. I just brought it up because I'd still expect the model to delete in this scenario and removing the finalizer was again a solution. But you're right, it's not caused by a failed model, there's an issue with the server first.
As a follow up to this point, while you can force-delete the model resource by removing the finaliser, the model metadata persists within Seldon:
model state reason
----- ----- ------
test-model ModelTerminating
This makes it hard to reapply the model manifest since Seldon believes it is still in a terminating state. My current workaround here is to remove and reinstall seldon-core-v2-runtime
, which isn't really an option for a production system.
@cliveseldon Would you happen to know when you guys are able to have a look at this issue, please? We are blocked currently because of this and cannot deploy further as it's pretty big issue for us 😞
Yeah this is blocking us on deploying to various environments, please can we get an idea of a fix and timeline please?
@cliveseldon awesome stuff getting the finalisers added for Servers and unblocking most of our environments. I know you wanted to add Models too but didn't have a chance. We would still be very interested in getting this too mostly for our development environments. Any chance of this?
I've noticed that if anything at all goes wrong with the Server onto which a Model needs to be deployed (specified in Model object via spec.server
value), the finalizers on the Model object are just stuck.
I also happen to need to restart seldon-scheduler
in order to correctly load the model on a Server again in any further try. There's some bad state internally in the scheduler for a Model which was not successfully loaded onto a Server.
I.e. this example we've discovered last time:
MLSERVER_PARALLEL_WORKERS
(https://mlserver.readthedocs.io/en/latest/reference/settings.html#mlserver.settings.Settings.parallel_workers) to some high number (we've had 4),.pkl
file).pkl
file) load via https://github.com/SeldonIO/MLServer/blob/master/mlserver/model.py#L58 method
agent
and rclone
container in SCv2.pkl
file at the same time - they don't have memory available)So after the sixth step, I need to:
agent
and rclone
into mlserver
container.This is all very tiring and not production ready at all.
Hello Everyone. I just wanted to comment on this active issue to let you know that we are working on solving this and on improving model state management as well. We will have more details on the fixes in the new year.
Describe the bug
"ModelTerminating"
state when listing models via the CLI."ModelTerminating"
state)Expected behaviour
kubectl delete model/name
to remove model resources without issueEnvironment
Resources