FederatedAI / KubeFATE

Manage federated learning workload using cloud native technologies.
Apache License 2.0
418 stars 222 forks source link

kubefate创建集群时,任务卡在了checkout Cluster status #882

Open gaohan1996 opened 1 year ago

gaohan1996 commented 1 year ago

minikube:1.20.0 docker:23.0.1 kubefate:1.4.4 fate:1.8.0 chartVersion: v1.8.0 centos:7.9 任务状态:

[root@localhost kubefate]# kubefate job ls UUID CREATOR METHOD STATUS STARTTIME CLUSTERID AGE
c8c9ccfc-6efa-4cca-b3e6-a265090d0165 admin ClusterInstall Running 2023-04-26 09:10:55 ec262039-6cc9-4cab-9981-cdb4c863c473 43m

[root@localhost kubefate]# kubefate job describe c8c9ccfc-6efa-4cca-b3e6-a265090d0165 UUID c8c9ccfc-6efa-4cca-b3e6-a265090d0165 StartTime 2023-04-26 09:10:55
EndTime 0001-01-01 00:00:00
Duration 37m
Status Running
Creator admin
ClusterId ec262039-6cc9-4cab-9981-cdb4c863c473 States - update job status to Running

owlet42 commented 1 year ago

进入到集群里面用kubectl命令看下pod的状态,通常情况大概率是镜像下载失败。

gaohan1996 commented 1 year ago

@owlet42 我现在是需要在开发环境离线部署,已经上传了所需的所有镜像 REPOSITORY TAG IMAGE ID CREATED SIZE hub.c.163.com/federatedai/eggroll 1.8.0-release 451ed4390a89 12 months ago 2.14GB hub.c.163.com/federatedai/fateboard 1.8.0-release 7096ddfc141f 12 months ago 193MB hub.c.163.com/federatedai/python 1.8.0-release 1c074a73ee85 12 months ago 2.03GB hub.c.163.com/federatedai/client 1.8.0-release 148f1ce597ba 12 months ago 1.54GB hub.c.163.com/federatedai/mysql 8.0.28 d1dc36cf8d9e 15 months ago 519MB hub.c.163.com/federatedai/fluentd v1.12 b55a9763c6e7 2 years ago 47.5MB hub.c.163.com/federatedai/mariadb 10 cbbff8572fa8 2 years ago 406MB 但是依然存在checkout Cluster status timeout的错误

owlet42 commented 1 year ago

@gaohan1996 部署之后,你看下所有pod的状态,看有没有一直运行失败或者不能下载镜像的pod,然后查看下具体的问题。