Open SDJustus opened 1 year ago
I agree this is a bug (without having yet reproduced it), but actually in the sense that the scheduler should only ever have zero or one replicas: 1 for normal operation, 0 for testing purposes. The scheduler is stateful and not designed for distributed ownership or co-ordination of resources.
Likewise, Hodometer should only ever have 1 replica if enabled, and otherwise be disabled.
Alright, thanks for the information. So am I right assuming, that zero downtime in case of something like a EKS or GKE upgrade (i.e. AMI Image updates) is not possible out of the box, with the scheduler only being able to run 1 replica at a time? Or is the scheduler not needed for executing inferences originating from the seldon-mesh service?
The latter -- the scheduler is a control plane-only component involved in inferencing.
The rest of the system should continue to operate if the scheduler is temporarily unavailable, for example during a rollout or due to a node going down, but you'd be unable to schedule or unschedule any models or pipelines until it was back.
Ok, thanks for the quick response... Should the replicas therefore be configurable, when only 0 or 1 replicas are allowed with 0 being only viable for testing purposes?
Personally I think it makes sense to remove replicas
and leave disable
for controlling whether that component is present or not
Describe the bug
inside the seldon-runtime helm chart, the scheduler
replicas
value is not propergated to the Chart.Expected behaviour
The replica count of the scheduler inside the SeldonRuntime Helm Chart to be used.