AI-Hypercomputer / xpk

xpk (Accelerated Processing Kit, pronounced x-p-k,) is a software tool to help Cloud developers to orchestrate training jobs on accelerators such as TPUs and GPUs on GKE.
Apache License 2.0
81 stars 23 forks source link

XPK cleanup: integ tests and code cleanup #121

Open Obliviour opened 7 months ago

Obliviour commented 7 months ago

Improvements:

  1. Use enum for accelerator type
  2. remove usage of device type based on --tpu-type / --device-type check everywhere. Do this in one place.
  3. Remove h100 device specific workflow in favor of GPU specific workflows
  4. Add nightly test for xpk help commands
  5. Add nightly tests for autoscaling with xpk
  6. Remove outdated future steps
  7. Add system: SystemCharacteristics to arg types and return values
  8. add_env_config now returns int instead of using exceptions maintaining consistent flow with other functions in xpk

Testing:

Integ tests