awslabs / benchmark-ai

Anubis (formerly known as Benchmark AI), measures the goodness of machine learning workloads
Apache License 2.0
17 stars 6 forks source link

Kubeflow can't locate server #256

Open marcoabreu opened 5 years ago

marcoabreu commented 5 years ago

When calling baictl create infra from scratch, the following error gets printed but doesn't interrupt the process:

==> Installing kubeflow operators
-> Kubeflow should be already installed, re-applying configuration
ERROR Attempting to deploy to environment 'kubeflow' at 'https://468209DA548272EBB3A9424948EF73D0.sk1.us-west-2.eks.amazonaws.com', but cannot locate a server at that address
[ERROR] Failed with exit code: 1

In turn, the following error is printed later on:

==> Validating infrastructure
MPI Job is presentError from server (NotFound): customresourcedefinitions.apiextensions.k8s.io "mpijobs.kubeflow.org" not found
...FAILED
MXNET Job is presentError from server (NotFound): customresourcedefinitions.apiextensions.k8s.io "mxjobs.kubeflow.org" not found
...FAILED
marcoabreu commented 5 years ago

Seems like baictl destroy infra does not delete ~/.bai/kubeflow-ks-app and thus _install_kubeflow_operators thinks that kubeflow is already installed.

marcoabreu commented 5 years ago

After deleting that directory, everything passes.