Scanflow Support Schedulers

bsc-scanflow / scanflow

Scanflow-Kubernetes is a platform to simplify MLOps.

MIT License

6 stars 0 forks source link

Scanflow Support Schedulers #7

Open peiniliu opened 3 years ago

peiniliu commented 3 years ago

Scanflow described ML workflows need to be deployed on K8s clusters. In order to improve the performance of each workflow job or service, this issue requests Scanflow could use default scheduler or enhanced volcano batch scheduler to intelligently deploy the workflows on the cluster.

Describe the solution you'd like Use volcano batch scheduler to deploy ML workflows(Argo workflows - jobs) Use default k8s scheduler to deploy ML services(Seldon workflows - services)

[question? how to support multi-scheduler? for current knowledge, we need to tell the scheduler which nodes can be operated.]

Policies Support Affinity-Aware scheduling Support proportion of resources for node

peiniliu commented 3 years ago

https://github.com/volcano-sh/volcano/blob/master/docs/design/multi-scheduler.md

peiniliu commented 3 years ago

https://github.com/volcano-sh/volcano/blob/master/docs/design/numa-aware.md

j-guitart commented 3 years ago

https://github.com/kubernetes-sigs/scheduler-plugins

j-guitart commented 3 years ago

Scheduling features that we might apply/consider:

Affinity and anti-affinity: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#affinity-and-anti-affinity
Pod Priority and Preemption: https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/
Resource Limits: https://kubernetes.io/docs/tasks/configure-pod-container/assign-cpu-resource/
Gang scheduling/Coscheduling: see Volcano and https://github.com/kubernetes-sigs/scheduler-plugins/blob/master/pkg/coscheduling/README.md
Topology-aware scheduling: see Volcano and https://github.com/kubernetes-sigs/scheduler-plugins/blob/master/pkg/noderesourcetopology/README.md
Scheduling considering actual resource utilization instead of resource allocation: https://github.com/kubernetes-sigs/scheduler-plugins/blob/master/pkg/trimaran/README.md

j-guitart commented 3 years ago

For ML services, we might additionally consider:

Pod Horizontal Autoscaling: https://keda.sh/
Health checks: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/