Support GPU jobs in FedScale K8S deployment - Githubissues

SymbioticLab / FedScale

FedScale is a scalable and extensible open-source federated learning (FL) platform.

https://fedscale.ai

Apache License 2.0

388 stars 119 forks source link

Support GPU jobs in FedScale K8S deployment #187

Closed IKACE closed 2 years ago

IKACE commented 2 years ago

Why are these changes needed?

Support GPU jobs in FedScale K8S deployment, and some quality-of-life enhancements.

Support training using GPU for FedScale k8s jobs
Support time-sharing GPU feature so that multiple FedScale k8s jobs can share the same GPU simultaneously (no specific changes in FedScale repo, changes are made solely on k8s infra)
Support checking FedScale k8s job progress interactively

Related issue number

Checks

[x] I've made sure the following tests are passing.
1. K8S GPU jobs
  - [x] Dry Run (45 training rounds & 3 evaluation round)
  - [x] Cifar 10 (45 training rounds & 3 evaluation round)
  - [x] Femnist (45 training rounds & 3 evaluation round)
  - 1. Regression 1: K8S CPU jobs
  - [x] Dry Run (45 training rounds & 3 evaluation round)
  - [x] Cifar 10 (45 training rounds & 3 evaluation round)
  - [x] Femnist (45 training rounds & 3 evaluation round)
  - 1. Regression 2: Original CPU jobs
  - [x] Dry Run (20 training rounds & 1 evaluation round)
  - [x] Cifar 10 (20 training rounds & 1 evaluation round)
  - [x] Femnist (20 training rounds & 1 evaluation round)

fanlai0990 commented 2 years ago

Thanks! Looks good to me.