awslabs / benchmark-ai

Anubis (formerly known as Benchmark AI), measures the goodness of machine learning workloads
Apache License 2.0
16 stars 6 forks source link

Custom toml file for EKS performance test for TF is not running #1025

Open TusharKanekiDey opened 4 years ago

TusharKanekiDey commented 4 years ago

I am trying to run EKS performance test for TF 2.x by using this custom toml file. tf2.1.txt

This script was running properly on May 13th. But since then, it has not been running.

I am getting this error when I do kubectl logs benchmark despite purging a lot of idle pods.

base) f8ffc23e5335:benchmark-ai tshdy$ kubectl logs b-55c8fda6-1b69-4a35-9198-fa23e7ae65c5-launcher-7ldbm benchmark
Error from server (BadRequest): container "benchmark" in pod "b-55c8fda6-1b69-4a35-9198-fa23e7ae65c5-launcher-7ldbm" is waiting to start: PodInitializing

I am getting this when I do kubectl logs kubectl-delivery

(base) f8ffc23e5335:benchmark-ai tshdy$ kubectl logs b-3b4b4ff7-f389-42f8-a32c-54d3aef89d30-launcher-jjlm6 kubectl-delivery
I0517 02:16:24.035807       1 server.go:53] NAMESPACE not set, use default namespace
I0517 02:16:24.035845       1 server.go:63] Scoping operator to namespace default
I0517 02:16:24.035853       1 server.go:67] API Version: v1 Version: v0.1.0 Git SHA: Not provided. Built: Not provided. Go Version: go1.13.6 Go OS/Arch: linux/amd64
I0517 02:16:24.035866       1 server.go:70] Server options: &{Kubeconfig: MasterURL: Threadiness:2 PrintVersion:false Namespace:}
W0517 02:16:24.036046       1 client_config.go:541] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
F0517 02:16:24.037029       1 server.go:95] Error open file/etc/mpi/hostfile: open /etc/mpi/hostfile: no such file or directory

Please let me know if any additional information such as describe pods output is required

haohanchen-aws commented 4 years ago

Similar issue here. Use the toml to launch a 4-node distributed job. The launcher pod would stuck at Init:CrashLoopBackOff and got this when I do kubectl logs kubectl-delivery

(base) a483e75de716:pub chehaoha$ kubectl logs b-4c0d1306-584c-49d6-99c4-c7825c584b7c-launcher-5pcfv kubectl-delivery I0528 19:56:33.433654 1 server.go:53] NAMESPACE not set, use default namespace I0528 19:56:33.433690 1 server.go:63] Scoping operator to namespace default I0528 19:56:33.433697 1 server.go:67] [API Version: v1 Version: v0.1.0 Git SHA: Not provided. Built: Not provided. Go Version: go1.13.6 Go OS/Arch: linux/amd64] I0528 19:56:33.433709 1 server.go:70] Server options: &{Kubeconfig: MasterURL: Threadiness:2 PrintVersion:false Namespace:} W0528 19:56:33.433879 1 client_config.go:541] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. F0528 19:56:33.434846 1 server.go:95] Error open file[/etc/mpi/hostfile]: open /etc/mpi/hostfile: no such file or directory

haohanchen-aws commented 4 years ago

Tried to run mask-rcnn on EKS without Anubis, and got the similar error (base) a483e75de716:eks chehaoha$ kubectl log maskrcnn-launcher-99cnb kubectl-delivery log is DEPRECATED and will be removed in a future version. Use logs instead. I0615 19:35:45.176433 1 server.go:53] NAMESPACE not set, use default namespace I0615 19:35:45.176473 1 server.go:63] Scoping operator to namespace default I0615 19:35:45.176481 1 server.go:67] [API Version: v1 Version: v0.1.0 Git SHA: Not provided. Built: Not provided. Go Version: go1.13.6 Go OS/Arch: linux/amd64] I0615 19:35:45.176501 1 server.go:70] Server options: &{Kubeconfig: MasterURL: Threadiness:2 PrintVersion:false Namespace:} W0615 19:35:45.176687 1 client_config.go:541] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. F0615 19:35:45.177683 1 server.go:95] Error open file[/etc/mpi/hostfile]: open /etc/mpi/hostfile: no such file or directory (base) a4