baidu / paddle-on-k8s-operator

Kubernetes operator for managing the lifecycle of PaddlePaddle job.
https://www.paddlepaddle.org.cn/
Apache License 2.0
24 stars 6 forks source link

使用最新的go1.13版本进行编译后运行提示Failed to list *v1alpha1.TrainingJob #21

Open levinxo opened 4 years ago

levinxo commented 4 years ago

项目目前提供的老版本镜像 https://hub.docker.com/r/tizhou86/paddle-on-k8s-operator 里强制限制了 master pod里etcd 容器的镜像(镜像为m3ngyang/etcd:v3.2.1)拉取策略为Always,在无互联网的集群中operator无法完全跑起来。

所以对master分支的代码进行修改,将etcd镜像拉取策略由Always改为IfNotPresent,重新build: go version: 1.13.6 linux/amd64 go build -o paddle-on-k8s-operator ./cmd/operator

镜像Dockerfile和项目提供的Dockerfile保持一致: FROM ubuntu:18.04 ADD paddle-on-k8s-operator /usr/local/bin ENTRYPOINT ["/usr/local/bin/paddle-on-k8s-operator"]

部署operator后,错误日志如下: E0413 10:36:30.127845 1 reflector.go:205] pkg/mod/k8s.io/client-go@5.0.1+incompatible/tools/cache/reflector.go:99: Failed to list *v1alpha1.TrainingJob: the server could not find the requested resource (get trainingjobs.paddlepaddle.org)

看了下项目源码,问题应该出现在一个异步的goroutine上: ./cmd/operator/paddle_operator.go Line 81: go paddleInformer.Start(stopCh)

cwdsuzhou commented 4 years ago

seems the crd did not create successfully.

levinxo commented 4 years ago

any solutions?