(helm-charts): Scheduler replicas field is not used by the Helm Chart

SeldonIO / seldon-core

An MLOps framework to package, deploy, monitor and manage thousands of production machine learning models

https://www.seldon.io/tech/products/core/

Other

4.37k stars 831 forks source link

(helm-charts): Scheduler replicas field is not used by the Helm Chart #5138

Open SDJustus opened 1 year ago

SDJustus commented 1 year ago

Describe the bug

inside the seldon-runtime helm chart, the scheduler replicas value is not propergated to the Chart.

Expected behaviour

The replica count of the scheduler inside the SeldonRuntime Helm Chart to be used.

agrski commented 1 year ago

I agree this is a bug (without having yet reproduced it), but actually in the sense that the scheduler should only ever have zero or one replicas: 1 for normal operation, 0 for testing purposes. The scheduler is stateful and not designed for distributed ownership or co-ordination of resources.

Likewise, Hodometer should only ever have 1 replica if enabled, and otherwise be disabled.

SDJustus commented 1 year ago

Alright, thanks for the information. So am I right assuming, that zero downtime in case of something like a EKS or GKE upgrade (i.e. AMI Image updates) is not possible out of the box, with the scheduler only being able to run 1 replica at a time? Or is the scheduler not needed for executing inferences originating from the seldon-mesh service?

agrski commented 1 year ago

The latter -- the scheduler is a control plane-only component involved in inferencing.

The rest of the system should continue to operate if the scheduler is temporarily unavailable, for example during a rollout or due to a node going down, but you'd be unable to schedule or unschedule any models or pipelines until it was back.

SDJustus commented 1 year ago

Ok, thanks for the quick response... Should the replicas therefore be configurable, when only 0 or 1 replicas are allowed with 0 being only viable for testing purposes?

agrski commented 1 year ago

Personally I think it makes sense to remove replicas and leave disable for controlling whether that component is present or not