awslabs / benchmark-ai

Anubis (formerly known as Benchmark AI), measures the goodness of machine learning workloads
Apache License 2.0
16 stars 6 forks source link

[Baictl] Kubeflow Install Failure #541

Closed Chancebair closed 5 years ago

Chancebair commented 5 years ago

When running ./baictl-infrastructure create on any account us-west-2

==> Installing kubeflow operators 1562248567929
+ _check_kubeflow_app_config 1562248567929
+ local kubeflow_config_path=/root/.bai/kubeflow-ks-app/ks_app/app.yaml 1562248567929
+ [[ ! -f /root/.bai/kubeflow-ks-app/ks_app/app.yaml ]] 1562248567929
+ echo 'Kubeflow is not initialized yet' 1562248567929
+ rm -rf /root/.bai/kubeflow-ks-app 1562248567929
Kubeflow is not initialized yet 1562248567929
+ return 0 1562248567930
+ mkdir /root/.bai/kubeflow-ks-app 1562248567930
+ cd /root/.bai/kubeflow-ks-app 1562248567931
+ export KUBECONFIG=/root/.bai/kubeconfig 1562248567931
+ KUBECONFIG=/root/.bai/kubeconfig 1562248567931
+ ks init ks_app 1562248567931
level=info msg="Using context \"eks_benchmark-cluster\" from kubeconfig file \"/root/.bai/kubeconfig\"" 1562248568217EVENTS 1562248573440   level=info msg="Creating environment \"default\" with namespace \"default\", pointing to \"version:v1.13.7\" cluster at address \"https://84F650F9D0584E4A8C6F85E24FF42FCD.sk1.us-west-2.eks.amazonaws.com\""   1562248568593
level=info msg="Generating ksonnet-lib data at path '/root/.bai/kubeflow-ks-app/ks_app/lib/ksonnet-lib/v1.13.7'"    1562248576417
level=info msg="Generating ksonnet-lib data at path '/root/.bai/kubeflow-ks-app/ks_app/lib/ksonnet-lib/v1.13.7'" 1562248576417
-> Creating kubeflow namespace 1562248576808
+ cd ks_app 1562248576808
+ echo '-> Creating kubeflow namespace' 1562248576808
+ __create_kubeflow_namespace 1562248576808
+ cat 1562248576808
+ kubectl --kubeconfig=/root/.bai/kubeconfig apply -f - 1562248576808
namespace/kubeflow created 1562248577715
-> Creating kubeflow environment 1562248577718
+ echo '-> Creating kubeflow environment' 1562248577718
+ ks registry describe kubeflow 1562248577718
level=error msg="registry \"kubeflow\" doesn't exist" 1562248578010
+ ks registry add kubeflow https://github.com/kubeflow/kubeflow/tree/v0.4.1/kubeflow 1562248578020
+ ks env describe kubeflow 1562248579447
level=error msg="environment \"kubeflow\" was not found" 1562248579709
+ ks env add kubeflow --namespace kubeflow 1562248579710
level=info msg="Using context \"eks_benchmark-cluster\" from kubeconfig file \"/root/.bai/kubeconfig\"" 1562248580010
level=info msg="Creating environment \"kubeflow\" with namespace \"kubeflow\", pointing to \"version:v1.13.7\" cluster at address \"https://84F650F9D0584E4A8C6F85E24FF42FCD.sk1.us-west-2.eks.amazonaws.com\"" 1562248580302
-> Applying kubeflow stuff 1562248588012
+ echo '-> Applying kubeflow stuff' 1562248588012
+ ks pkg list --installed -o table 1562248588014
+ grep mpi-job 1562248588014
+ xargs 1562248588014
+ awk '{print $2}' 1562248588014
+ ks pkg install kubeflow/mpi-job 1562248588312
level=info msg="Retrieved 6 files" 1562248590720
+ ks pkg list --installed -o table 1562248590731
+ grep mxnet-job 1562248590731
+ xargs 1562248590731
+ awk '{print $2}' 1562248590731
+ ks pkg install kubeflow/mxnet-job 1562248591013
level=info msg="Retrieved 5 files" 1562248593265
+ awk '{print $1}' 1562248593269
+ ks component list 1562248593269
+ xargs 1562248593269
+ grep mpi-operator 1562248593269
+ ks generate mpi-operator mpi-operator 1562248593516
level=info msg="Writing component at '/root/.bai/kubeflow-ks-app/ks_app/components/mpi-operator.jsonnet'" 1562248593815
+ xargs 1562248593817
+ awk '{print $1}' 1562248593817
+ ks component list 1562248593818
+ grep mxnet-operator 1562248593818
+ ks generate mxnet-operator mxnet-operator 1562248594205
level=info msg="Writing component at '/root/.bai/kubeflow-ks-app/ks_app/components/mxnet-operator.jsonnet'" 1562248594508
+ ks apply kubeflow --component mpi-operator 1562248594509
/work/baictl/drivers/aws/baidriver: line 234: 834 Killed ks apply kubeflow --component mpi-operator 1562248608722
+ local exit_code=137 1562248608722
+ [[ 137 != 0 ]] 1562248608722
+ echo '[ERROR] Failed with exit code: 137' 1562248608722
+ return 1 1562248608722
+ return 1 1562248608722
[ERROR] Failed with exit code: 137 1562248608722
Usage: baictl [verb] [object] [options] 1562248608722
marcoabreu commented 5 years ago

https://github.com/MXNetEdge/benchmark-ai/blob/fb1a5629869fee04b979cad49427a490bbda9808/baictl/drivers/aws/baidriver#L257

Seems to be failing

Chancebair commented 5 years ago

From baidriver

ks apply kubeflow --component mpi-operator --verbose --dry-run && \
        ks apply kubeflow --component mxnet-operator
    )
==> Installing kubeflow operators
 + _check_kubeflow_app_config
 Kubeflow is not initialized yet
 + local kubeflow_config_path=/root/.bai/kubeflow-ks-app/ks_app/app.yaml + [[ ! -f /root/.bai/kubeflow-ks-app/ks_app/app.yaml ]] + echo 'Kubeflow is not initialized yet'
 + rm -rf /root/.bai/kubeflow-ks-app
 + return 0
 + mkdir /root/.bai/kubeflow-ks-app
 + cd /root/.bai/kubeflow-ks-app
 + export KUBECONFIG=/root/.bai/kubeconfig
 + KUBECONFIG=/root/.bai/kubeconfig
 + ks init ks_app
 level=info msg="Using context "eks_benchmark-cluster" from kubeconfig file "/root/.bai/kubeconfig""
 level=info msg="Creating environment "default" with namespace "default", pointing to "version:v1.13.7" cluster at address "https://84F650F9D0584E4A8C6F85E24FF42FCD.sk1.us-west-2.eks.amazonaws.com""
 level=info msg="Generating ksonnet-lib data at path '/root/.bai/kubeflow-ks-app/ks_app/lib/ksonnet-lib/v1.13.7'"
 + cd ks_app
 + echo '-> Creating kubeflow namespace'
 -> Creating kubeflow namespace
 + kubectl --kubeconfig=/root/.bai/kubeconfig apply -f + __create_kubeflow_namespace
 + cat
 namespace/kubeflow unchanged
 + echo '-> Creating kubeflow environment'
 -> Creating kubeflow environment
 + ks registry describe kubeflow
 level=error msg="registry "kubeflow" doesn't exist"
 + ks registry add kubeflow https://github.com/kubeflow/kubeflow/tree/v0.4.1/kubeflow + ks env describe kubeflow
 level=error msg="environment "kubeflow" was not found"
 + ks env add kubeflow --namespace kubeflow
 level=info msg="Using context "eks_benchmark-cluster" from kubeconfig file "/root/.bai/kubeconfig""
 level=info msg="Creating environment "kubeflow" with namespace "kubeflow", pointing to "version:v1.13.7" cluster at address "https://84F650F9D0584E4A8C6F85E24FF42FCD.sk1.us-west-2.eks.amazonaws.com""
 + echo '-> Applying kubeflow stuff'
 -> Applying kubeflow stuff
 + ks pkg list --installed -o table
 + grep mpi-job
 + xargs
 + awk '{print $2}'
 + ks pkg install kubeflow/mpi-job
 level=info msg="Retrieved 6 files"
 + awk '{print $2}'
 + ks pkg list --installed -o table
 + xargs
 + grep mxnet-job
 + ks pkg install kubeflow/mxnet-job
 level=info msg="Retrieved 5 files"
 + ks component list
 + xargs
 + awk '{print $1}'
 + grep mpi-operator
 + ks generate mpi-operator mpi-operator
 level=info msg="Writing component at '/root/.bai/kubeflow-ks-app/ks_app/components/mpi-operator.jsonnet'"
 + awk '{print $1}'
 + grep mxnet-operator
 + xargs
 + ks component list
 + ks generate mxnet-operator mxnet-operator
 level=info msg="Writing component at '/root/.bai/kubeflow-ks-app/ks_app/components/mxnet-operator.jsonnet'"
 + ks apply kubeflow --component mpi-operator --verbose --dry-run level=debug msg="setting log verbosity" verbosity-level=1 level=debug msg="loading application configuration from /root/.bai/kubeflow-ks-app/ks_app"
 level=debug msg="loading schema version 0.3.0"
 level=debug msg="loading overrides from /root/.bai/kubeflow-ks-app/ks_app"
 level=debug msg="loading overrides from /root/.bai/kubeflow-ks-app/ks_app"
 level=debug msg="Validating deployment at 'kubeflow' with server '[https://84f650f9d0584e4a8c6f85e24ff42fcd.sk1.us-west-2.eks.amazonaws.com]'"
 level=debug msg="Overwriting --cluster flag with 'eks_benchmark-cluster'"
 level=debug msg="Overwriting --namespace flag with 'kubeflow'"
 level=debug msg="creating ks pipeline for environment "kubeflow""
 level=debug msg="building objects" action=pipeline module-name=/ level=debug msg="jsonnet evaluate snippet" elapsed="395.162µs" name=params.libsonnet level=debug msg="jsonnet evaluate snippet" elapsed=1.669682ms name=applyGlobals level=debug msg="jsonnet evaluate snippet" elapsed=79.860162ms name=modularize-params level=debug msg="jsonnet evaluate snippet" elapsed=2.299804ms name=params-for-module level=debug msg="jsonnet evaluate snippet" elapsed="279.929µs" name=/root/.bai/kubeflow-ks-app/ks_app/environments/kubeflow/params.libsonnet level=debug msg="preparing package /root/.bai/kubeflow-ks-app/ks_app/vendor/kubeflow/mpi-job@797bcb7407a589bacc35b9624120f51f36a83468->/tmp/ksvendor531599041/kubeflow/mpi-job" action=env.revendorPackages level=debug msg="preparing package /root/.bai/kubeflow-ks-app/ks_app/vendor/kubeflow/mxnet-job@797bcb7407a589bacc35b9624120f51f36a83468->/tmp/ksvendor531599041/kubeflow/mxnet-job" action=env.revendorPackages /work/baictl/drivers/aws/baidriver: line 234: 552 Killed ks apply kubeflow --component mpi-operator --verbose --dry-run [ERROR] Failed with exit code: 137
 + local exit_code=137
 + [[ 137 != 0 ]]
 + echo '[ERROR] Failed with exit code: 137'
 + return 1
 + return 1
 Usage: baictl [verb] [object] [options]
Chancebair commented 5 years ago

When I run ks commands locally I get

ERROR handle object: patching object from cluster: merging object with existing state: unable to recognize "/var/folders/n6/lkbqfq0506vdyvqp6qbmyrsrmtf9fx/T/ksonnet-mergepatch148807747": no matches for kind "CustomResourceDefinition" in version "apiextensions.k8s.io/v1beta1"

But I'm not sure why that isn't showing up in the logs

Chancebair commented 5 years ago

Running the script locally the error does not present itself. Somehow only when run on a container

Chancebair commented 5 years ago

Will no longer use ECS, but switching to the codebuild pipeline approach