PKUHPC / SCOW

Super Computing On Web
https://www.pkuscow.com/
Mulan Permissive Software License, Version 2
180 stars 39 forks source link

[Help] 更新以后不显示集群 #1303

Closed zhengkang2020 closed 2 weeks ago

zhengkang2020 commented 2 weeks ago

发生了什么

执行./cli compose pull && ./cli compose down && ./cli compose up -d后 1、主页提示没有可用集群 image 2、创建作业报错,提示404 image 3、集群管理中正常 image 4、尝试编译最新版本的适配器,但是按文档编译报错

[root@manager scow-slurm-adapter-master]# make build
CGO_BUILD=0 GOARCH=amd64 go build -o scow-slurm-adapter-amd64
# scow-slurm-adapter/gen/go
gen/go/account_grpc.pb.go:30:16: undefined: grpc.SupportPackageIsVersion8
gen/go/account_grpc.pb.go:94:41: undefined: grpc.StaticMethod
gen/go/account_grpc.pb.go:104:41: undefined: grpc.StaticMethod
gen/go/account_grpc.pb.go:114:41: undefined: grpc.StaticMethod
gen/go/account_grpc.pb.go:124:41: undefined: grpc.StaticMethod
gen/go/app_grpc.pb.go:30:16: undefined: grpc.SupportPackageIsVersion8
gen/go/config_grpc.pb.go:30:16: undefined: grpc.SupportPackageIsVersion8
gen/go/job_grpc.pb.go:30:16: undefined: grpc.SupportPackageIsVersion8
gen/go/user_grpc.pb.go:30:16: undefined: grpc.SupportPackageIsVersion8
gen/go/version_grpc.pb.go:30:16: undefined: grpc.SupportPackageIsVersion8
gen/go/account_grpc.pb.go:124:41: too many errors
make: *** [Makefile:10: build] Error 2

之前运行正常吗?

./cli compose pull之前运行正常!

运行环境 | Environment

- OS: RockyLinux 9.4
- Scheduler: slurm-23.11.1
- Docker: Docker version 24.0.7, build afdd53b
- Docker-compose: Docker Compose version v2.23.3
- SCOW cli: Version 1.5.2
- SCOW: master
- Adapter: Apr 7更新的版本,不记得具体版本了,重新编译更新1.5版本报错
zhengkang2020 commented 2 weeks ago

第4步已经解决!

## 1、修改grpc的版本,将1.55修改为1.64
vim go.mod

google.golang.org/grpc v1.64.0

## 2、更新版本
go mod tidy

## 3、重新构建
make build

但是使用新构建的scow-slurm-adapter-amd64 问题仍未解决!

zhengkang2020 commented 2 weeks ago

重新配置scow,初始化时提示当前正在访问的集群不可用或没有可用集群。请稍后再试或联系管理员。

image

[root@scow scow]# ./cli compose ps
INFO: Loaded plugins: []
WARN[0000] /scow/docker-compose-1718865568980.yml: `version` is obsolete
NAME                   IMAGE                                         COMMAND                  SERVICE         CREATED              STATUS              PORTS
scow-audit-db-1        mysql:8                                       "docker-entrypoint.s…"   audit-db        About a minute ago   Up About a minute   3306/tcp, 33060/tcp
scow-audit-server-1    mirrors.pku.edu.cn/pkuhpc-icode/scow:master   "./entrypoint.sh"        audit-server    About a minute ago   Up About a minute   80/tcp, 3000/tcp, 5000/tcp
scow-auth-1            mirrors.pku.edu.cn/pkuhpc-icode/scow:master   "./entrypoint.sh"        auth            About a minute ago   Up About a minute   80/tcp, 3000/tcp, 5000/tcp
scow-db-1              mysql:8                                       "docker-entrypoint.s…"   db              About a minute ago   Up About a minute   3306/tcp, 33060/tcp
scow-gateway-1         mirrors.pku.edu.cn/pkuhpc-icode/scow:master   "./entrypoint.sh"        gateway         About a minute ago   Up About a minute   3000/tcp, 0.0.0.0:80->80/tcp, :::80->80/tcp, 5000/tcp
scow-log-1             fluentd:v1.14.0-1.0                           "tini -- /bin/entryp…"   log             About a minute ago   Up About a minute   5140/tcp, 0.0.0.0:24224->24224/tcp, 0.0.0.0:24224->24224/udp, :::24224->24224/tcp, :::24224->24224/udp
scow-mis-server-1      mirrors.pku.edu.cn/pkuhpc-icode/scow:master   "./entrypoint.sh"        mis-server      About a minute ago   Up About a minute   80/tcp, 3000/tcp, 5000/tcp
scow-mis-web-1         mirrors.pku.edu.cn/pkuhpc-icode/scow:master   "./entrypoint.sh"        mis-web         About a minute ago   Up About a minute   80/tcp, 3000/tcp, 5000/tcp
scow-novnc-1           ghcr.io/pkuhpc/novnc-client-docker:master     "/docker-entrypoint.…"   novnc           About a minute ago   Up About a minute   80/tcp
scow-portal-server-1   mirrors.pku.edu.cn/pkuhpc-icode/scow:master   "./entrypoint.sh"        portal-server   About a minute ago   Up About a minute   80/tcp, 3000/tcp, 5000/tcp
scow-portal-web-1      mirrors.pku.edu.cn/pkuhpc-icode/scow:master   "./entrypoint.sh"        portal-web      About a minute ago   Up About a minute   80/tcp, 3000/tcp, 5000/tcp
scow-redis-1           redis:alpine                                  "docker-entrypoint.s…"   redis           About a minute ago   Up About a minute   6379/tcp
zhengkang2020 commented 2 weeks ago

mis-server 的日志有报错信息:

[root@scow scow]# ./cli compose logs mis-server
INFO: Loaded plugins: []
WARN[0000] /scow/docker-compose-1718875515775.yml: `version` is obsolete
mis-server-1  |
mis-server-1  | > @scow/mis-server@1.5.2 serve
mis-server-1  | > node build/index.js
mis-server-1  |
mis-server-1  | {"level":30,"time":"2024-06-20T09:22:44.986Z","pid":17,"hostname":"bc7a670af1a1","msg":"Hook is not configured."}
mis-server-1  | {"level":30,"time":"2024-06-20T09:22:46.011Z","pid":17,"hostname":"bc7a670af1a1","version":{"commit":"6bde35bd1fc62b5e2187123cbe09cc1227a8ef10"},"msg":"@scow/mis-server: "}
mis-server-1  | {"level":30,"time":"2024-06-20T09:22:46.012Z","pid":17,"hostname":"bc7a670af1a1","config":{"HOST":"0.0.0.0","PORT":5000,"LOG_LEVEL":"info","LOG_PRETTY":false,"SSH_PRIVATE_KEY_PATH":"/root/.ssh/id_rsa","SSH_PUBLIC_KEY_PATH":"/root/.ssh/id_rsa.pub","AUTH_URL":"","DB_PASSWORD":"must!chang3this"},"msg":"Loaded env config"}
mis-server-1  | {"level":30,"time":"2024-06-20T09:22:46.981Z","pid":17,"hostname":"bc7a670af1a1","msg":"Checking if root can login to hpc-test by login node login"}
mis-server-1  | {"level":30,"time":"2024-06-20T09:22:47.149Z","pid":17,"hostname":"bc7a670af1a1","msg":"Root can login to hpc-test by login node login"}
mis-server-1  | {"level":30,"time":"2024-06-20T09:22:47.155Z","pid":17,"hostname":"bc7a670af1a1","msg":"Update cluster entity started."}
mis-server-1  | {"level":30,"time":"2024-06-20T09:22:47.226Z","pid":17,"hostname":"bc7a670af1a1","plugin":"price","msg":"Default Price Map: {}"}
mis-server-1  | {"level":30,"time":"2024-06-20T09:22:47.226Z","pid":17,"hostname":"bc7a670af1a1","plugin":"price","msg":"Tenant specific prices {}"}
mis-server-1  | {"level":30,"time":"2024-06-20T09:22:48.382Z","pid":17,"hostname":"bc7a670af1a1","plugin":"price","msg":"Executing on hpc01 success"}
mis-server-1  | {"level":40,"time":"2024-06-20T09:22:48.383Z","pid":17,"hostname":"bc7a670af1a1","plugin":"price","msg":"\n      The following price items are missing in platform scope: [\"hpc01.compute.normal\",\"hpc01.compute.low\",\"hpc01.compute.high\"].\n      An error will be thrown when such a job is fetched.\n    "}
mis-server-1  | {"level":30,"time":"2024-06-20T09:22:48.573Z","pid":17,"hostname":"bc7a670af1a1","plugin":"fetch","msg":"Fetch info started."}
mis-server-1  | {"level":30,"time":"2024-06-20T09:22:48.723Z","pid":17,"hostname":"bc7a670af1a1","plugin":"syncBlockStatus","msg":"misConfig.periodicSyncStatus?.cron: 0 4 * * *"}
mis-server-1  | {"level":30,"time":"2024-06-20T09:22:48.725Z","pid":17,"hostname":"bc7a670af1a1","plugin":"syncBlockStatus","msg":"Sync block status started."}
mis-server-1  | {"level":30,"time":"2024-06-20T09:22:48.729Z","pid":17,"hostname":"bc7a670af1a1","plugin":"cache","msg":"Cache clear scheduled task started."}
mis-server-1  | {"level":30,"time":"2024-06-20T09:22:48.785Z","pid":17,"hostname":"bc7a670af1a1","plugin":"syncBlockStatus","msg":"Current clusters list: (Cluster ID: hpc01) : ACTIVATED"}
mis-server-1  | {"level":30,"time":"2024-06-20T09:22:48.802Z","pid":17,"hostname":"bc7a670af1a1","plugin":"syncBlockStatus","msg":"Updated block status in slurm of the following accounts: []"}
mis-server-1  | {"level":30,"time":"2024-06-20T09:22:48.802Z","pid":17,"hostname":"bc7a670af1a1","plugin":"syncBlockStatus","msg":"Updated block status failed in slurm of the following accounts: []"}
mis-server-1  | {"level":30,"time":"2024-06-20T09:22:48.802Z","pid":17,"hostname":"bc7a670af1a1","plugin":"syncBlockStatus","msg":"Updated block status in slurm of the following user account: []"}
mis-server-1  | {"level":30,"time":"2024-06-20T09:22:48.803Z","pid":17,"hostname":"bc7a670af1a1","plugin":"syncBlockStatus","msg":"Updated block status failed in slurm of the following user account: []"}
mis-server-1  | {"level":30,"time":"2024-06-20T09:22:48.819Z","pid":17,"hostname":"bc7a670af1a1","plugin":"syncBlockStatus","msg":"Current clusters list: (Cluster ID: hpc01) : ACTIVATED"}
mis-server-1  | {"level":30,"time":"2024-06-20T09:22:48.820Z","pid":17,"hostname":"bc7a670af1a1","plugin":"syncBlockStatus","msg":"Updated unblock status in slurm of the following accounts: []"}
mis-server-1  | {"level":30,"time":"2024-06-20T09:22:48.821Z","pid":17,"hostname":"bc7a670af1a1","plugin":"syncBlockStatus","msg":"Updated unblock status failed in slurm of the following accounts: []"}
mis-server-1  | {"level":30,"time":"2024-06-20T09:22:48.883Z","pid":17,"hostname":"bc7a670af1a1","msg":"Listening at 5000"}
mis-server-1  | (node:17) DeprecationWarning: Calling start() is no longer necessary. It can be safely omitted.
mis-server-1  | (Use `node --trace-deprecation ...` to show where the warning was created)
mis-server-1  | {"level":30,"time":"2024-06-20T09:23:09.743Z","pid":17,"hostname":"bc7a670af1a1","req":"1","path":"/scow.server.InitService/QuerySystemInitialized","msg":"Starting request"}
mis-server-1  | {"level":30,"time":"2024-06-20T09:23:09.753Z","pid":17,"hostname":"bc7a670af1a1","req":"1","path":"/scow.server.InitService/QuerySystemInitialized","msg":"Request completed."}

麻烦看下是哪里的问题?

zhengkang2020 commented 2 weeks ago

使用master分支为以上情况,使用1.5.2分支正常!

cli更新到1.6后正常!

zhengkang2020 commented 2 weeks ago

升级cli到1.6版本后正常!