SeldonIO / seldon-core

An MLOps framework to package, deploy, monitor and manage thousands of production machine learning models
https://www.seldon.io/tech/products/core/
Other
4.4k stars 832 forks source link

Finalisers blocking resource deletion #5043

Open edfincham opened 1 year ago

edfincham commented 1 year ago

Describe the bug

Expected behaviour

Environment

Resources

kevinnowland commented 1 year ago

I am seeing what I think is a related issue. This is all in kind, deployed with latest helm charts this morning.

Setup:

  1. Deploy a server (mlserver in this case)
  2. Deploy a model that gets put onto that server

Action:

  1. Attempt to (mistakenly) delete server before deleting model. The server does not delete, which is expected, as it has models deployed to it.
  2. Attempt to delete model to resolve failed server delete

Result:

  1. Model does not delete, server does not delete

Workaround:

  1. Delete model finalizer and then model deletes, but server does not yet delete
  2. Delete server finalizer and then the server deletes.

Expectation: Deleting the model should actually delete the model and with all models removed, the sever delete should go through.

Please let me know if my expectation is out of line or my issue is not related to the original post. Thanks!

edfincham commented 1 year ago

Hey @kevinnowland, accidentally deleting a server with associated models before deleting the models themselves is definitely a scenario I have come across (and your workaround is what I've done to remedy it). However, in this case I am talking about deleting the models first :slightly_smiling_face:

kevinnowland commented 1 year ago

@edfincham True, that's a big difference. I just brought it up because I'd still expect the model to delete in this scenario and removing the finalizer was again a solution. But you're right, it's not caused by a failed model, there's an issue with the server first.

edfincham commented 1 year ago

As a follow up to this point, while you can force-delete the model resource by removing the finaliser, the model metadata persists within Seldon:

model           state           reason
-----           -----           ------
test-model      ModelTerminating

This makes it hard to reapply the model manifest since Seldon believes it is still in a terminating state. My current workaround here is to remove and reinstall seldon-core-v2-runtime, which isn't really an option for a production system.

Kolajik commented 1 year ago

@cliveseldon Would you happen to know when you guys are able to have a look at this issue, please? We are blocked currently because of this and cannot deploy further as it's pretty big issue for us 😞

dtpryce commented 1 year ago

Yeah this is blocking us on deploying to various environments, please can we get an idea of a fix and timeline please?

dtpryce commented 1 year ago

@cliveseldon awesome stuff getting the finalisers added for Servers and unblocking most of our environments. I know you wanted to add Models too but didn't have a chance. We would still be very interested in getting this too mostly for our development environments. Any chance of this?

Kolajik commented 1 year ago

I've noticed that if anything at all goes wrong with the Server onto which a Model needs to be deployed (specified in Model object via spec.server value), the finalizers on the Model object are just stuck.

I also happen to need to restart seldon-scheduler in order to correctly load the model on a Server again in any further try. There's some bad state internally in the scheduler for a Model which was not successfully loaded onto a Server.

I.e. this example we've discovered last time:

  1. try to set MLSERVER_PARALLEL_WORKERS (https://mlserver.readthedocs.io/en/latest/reference/settings.html#mlserver.settings.Settings.parallel_workers) to some high number (we've had 4),
  2. Have a large custom model (1+ GiB .pkl file)
  3. Set MLServer's memory to 1 GiB (both request and limit)
  4. Let the model (.pkl file) load via https://github.com/SeldonIO/MLServer/blob/master/mlserver/model.py#L58 method
    • this is happening automatically via agent and rclone container in SCv2
  5. See if the model loads (it will not, because all of the workers cannot load such a huge .pkl file at the same time - they don't have memory available)
  6. Try to delete Model object (you cannot, because of the finalizers)

So after the sixth step, I need to:

  1. manually remove finalizers from Model object,
  2. delete Server object,
  3. restart Seldon-Scheduler,
  4. redeploy Server and Model objects, in order to successfully trigger the loading of the model via agent and rclone into mlserver container.

This is all very tiring and not production ready at all.

ramonpzg commented 11 months ago

Hello Everyone. I just wanted to comment on this active issue to let you know that we are working on solving this and on improving model state management as well. We will have more details on the fixes in the new year.