awslabs / benchmark-ai

Anubis (formerly known as Benchmark AI), measures the goodness of machine learning workloads
Apache License 2.0
16 stars 6 forks source link

Upgrade MPI operator and k8s config templates #1036

Closed tejaschumbalkar closed 4 years ago

tejaschumbalkar commented 4 years ago

Issue #, if available: Issue #1022

Description of changes:

  1. Installing MPI and MXNET operators from kubeflow repository #1034 .
  2. Migrating MPIJob template to adapt v1aplha2 version supported by the MPI operator.
  3. Reverting #974 as horovod jobs will be supported
  4. Fix for #1037

Testing:

Tested the new MPIJob configuration on local EKS cluster as well as on the Anubis pipeline.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

stsukrov commented 4 years ago

This PR should be splitted in 3-4 parts to make reviewing easier.

stsukrov commented 4 years ago

Tested the new MPIJob configuration on local EKS cluster as well as on the Anubis pipeline.

Unit/Integration tests added? Is your change covered by existing tests?

Please consider manual testing as the last resort.

tejaschumbalkar commented 4 years ago

Tested the new MPIJob configuration on local EKS cluster as well as on the Anubis pipeline.

Unit/Integration tests added? Is your change covered by existing tests?

Please consider manual testing as the last resort.

The new code is covered by the existing test cases. Also, made sure that the new MPIJob k8s template generated by toml is able to run on anubis EKS cluster.

tejaschumbalkar commented 4 years ago

Separating out changes in multiple PR.