awslabs / benchmark-ai

Anubis (formerly known as Benchmark AI), measures the goodness of machine learning workloads
Apache License 2.0
16 stars 6 forks source link

Validating Infrastructure from ECS failed #470

Closed Chancebair closed 5 years ago

Chancebair commented 5 years ago
==> Validating infrastructure
Error from server (NotFound): customresourcedefinitions.apiextensions.k8s.io "mpijobs.kubeflow.org" not found
MPI Job is present...FAILED
Error from server (NotFound): customresourcedefinitions.apiextensions.k8s.io "mxjobs.kubeflow.org" not found
MXNET Job is present...FAILED
ZooKeeper is ready...PASSED
Kubectl is present...PASSED
Chancebair commented 5 years ago

Apparently this is us hitting request limits for unauthenticated users to the kubeflow github

Something to try is authenticating via a token first: https://help.github.com/en/articles/creating-a-personal-access-token-for-the-command-line

Chancebair commented 5 years ago

Looks like a missed failure in _install_kubeflow_operators from baidriver

==> Installing kubeflow operators

12:52:11
level=info msg="Using context \"eks_benchmark-cluster\" from kubeconfig file \"/root/.bai/kubeconfig\""

12:52:11
level=info msg="Creating environment \"default\" with namespace \"default\", pointing to \"version:v1.12.6\" cluster at address \"https://A66C4AD834C534191070EDAF6D16FAF7.yl4.us-west-2.eks.amazonaws.com\""

12:52:19
level=info msg="Generating ksonnet-lib data at path '/root/.bai/kubeflow-ks-app/ks_app/lib/ksonnet-lib/v1.12.6'"

12:52:20
-> Creating kubeflow namespace

12:52:20
namespace/kubeflow created

12:52:20
-> Applying kubeflow stuff

12:52:22
level=info msg="Using context \"eks_benchmark-cluster\" from kubeconfig file \"/root/.bai/kubeconfig\""

12:52:22
level=info msg="Creating environment \"kubeflow\" with namespace \"kubeflow\", pointing to \"version:v1.12.6\" cluster at address \"https://A66C4AD834C534191070EDAF6D16FAF7.yl4.us-west-2.eks.amazonaws.com\""

12:52:32
level=info msg="Retrieved 6 files"

12:52:34
level=info msg="Retrieved 5 files"

12:52:34
level=info msg="Writing component at '/root/.bai/kubeflow-ks-app/ks_app/components/mpi-operator.jsonnet'"

12:52:35
level=info msg="Writing component at '/root/.bai/kubeflow-ks-app/ks_app/components/mxnet-operator.jsonnet'"

12:52:35
-> Kubeflow should be already installed, re-applying configuration

12:52:45
[ERROR] Failed with exit code: 137

12:52:45
/work/baictl/drivers/aws/baidriver: line 222: 770 Killed ks apply kubeflow --component mpi-operator
Chancebair commented 5 years ago

Closing in favor of dupe https://github.com/MXNetEdge/benchmark-ai/issues/424