Open TusharKanekiDey opened 4 years ago
Similar issue here. Use the toml to launch a 4-node distributed job. The launcher pod would stuck at Init:CrashLoopBackOff and got this when I do kubectl logs kubectl-delivery
(base) a483e75de716:pub chehaoha$ kubectl logs b-4c0d1306-584c-49d6-99c4-c7825c584b7c-launcher-5pcfv kubectl-delivery I0528 19:56:33.433654 1 server.go:53] NAMESPACE not set, use default namespace I0528 19:56:33.433690 1 server.go:63] Scoping operator to namespace default I0528 19:56:33.433697 1 server.go:67] [API Version: v1 Version: v0.1.0 Git SHA: Not provided. Built: Not provided. Go Version: go1.13.6 Go OS/Arch: linux/amd64] I0528 19:56:33.433709 1 server.go:70] Server options: &{Kubeconfig: MasterURL: Threadiness:2 PrintVersion:false Namespace:} W0528 19:56:33.433879 1 client_config.go:541] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. F0528 19:56:33.434846 1 server.go:95] Error open file[/etc/mpi/hostfile]: open /etc/mpi/hostfile: no such file or directory
Tried to run mask-rcnn on EKS without Anubis, and got the similar error
(base) a483e75de716:eks chehaoha$ kubectl log maskrcnn-launcher-99cnb kubectl-delivery log is DEPRECATED and will be removed in a future version. Use logs instead. I0615 19:35:45.176433 1 server.go:53] NAMESPACE not set, use default namespace I0615 19:35:45.176473 1 server.go:63] Scoping operator to namespace default I0615 19:35:45.176481 1 server.go:67] [API Version: v1 Version: v0.1.0 Git SHA: Not provided. Built: Not provided. Go Version: go1.13.6 Go OS/Arch: linux/amd64] I0615 19:35:45.176501 1 server.go:70] Server options: &{Kubeconfig: MasterURL: Threadiness:2 PrintVersion:false Namespace:} W0615 19:35:45.176687 1 client_config.go:541] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. F0615 19:35:45.177683 1 server.go:95] Error open file[/etc/mpi/hostfile]: open /etc/mpi/hostfile: no such file or directory (base) a4
I am trying to run EKS performance test for TF 2.x by using this custom toml file. tf2.1.txt
This script was running properly on May 13th. But since then, it has not been running.
I am getting this error when I do kubectl logs benchmark despite purging a lot of idle pods.
I am getting this when I do kubectl logs kubectl-delivery
Please let me know if any additional information such as describe pods output is required