PKUHPC / scow-slurm-adapter

11 stars 7 forks source link

求助adapter 返回 The gres set error错误 #15

Closed menkeyi001 closed 5 months ago

menkeyi001 commented 5 months ago

scow

获取集群信息报错 http://10.192.1.87/api/dashboard/getClusterInfo?clusterId=hpc01 内容 {"code":"ADAPTER_CALL_ON_ONE_ERROR","details":"Cluster ID : hpc01 Details : Error: 5 NOT_FOUND: The gres set error.","clusterErrorsArray":[{"clusterId":"hpc01","details":{"code":5,"details":"The gres set error.","metadata":{"content-type":["application/grpc"],"grpc-status-details-bin":[{"type":"Buffer","data":[8,5,18,19,84,104,101,32,103,114,101,115,32,115,101,116,32,101,114,114,111,114,46,26,60,10,40,116,121,112,101,46,103,111,111,103,108,101,97,112,105,115,46,99,111,109,47,103,111,111,103,108,101,46,114,112,99,46,69,114,114,111,114,73,110,102,111,18,16,10,14,71,82,69,83,95,78,79,84,95,70,79,85,78,68]}]}}}]}

scow-slurm-adapter 可以开启debug么 adapter 的日志也没有错误显示

slurm是编译安装 目录如下
(base) root@slurmcontroller:/adapter# ls /etc/slurm/
bin  etc  include  lib  sbin  share
(base) root@slurmcontroller:/adapter# ls /etc/slurm/etc/
cgroup.conf  gres.conf  plugstack.conf  plugstack.conf.d  slurm.conf  slurm.conf.bak  slurmdbd.conf

adapter配置信息
(base) root@slurmcontroller:/adapter# cat config/config.yaml 
# slurm 数据库配置
mysql:
  host: 10.192.1.39
  port: 3306
  user: root
  dbname: slurm_acct_db
  password: abc@123
  clustername: cluster
  databaseencode: latin1

# 服务端口设置
service:
  port: 8972

# slurm 默认Qos设置
slurm:
  defaultqos: normal
  slurmpath: /etc/slurm/

# module profile文件路径
modulepath:
  path: /data/share/software/module/5.2.0/init/profile.sh
vanstriker commented 5 months ago

看上去是slurm配置GPU资源的gres相关内容有点问题,可以先试试slurm集群是否能正常使用。

menkeyi001 commented 5 months ago

看上去是slurm配置GPU资源的gres相关内容有点问题,可以先试试slurm集群是否能正常使用。

slurm 集群可以正常调度gpu

这个是我gres配置 (base) root@slurmcontroller:~# cat /etc/slurm/etc/gres.conf NodeName=compute Name=gpu Type=V100 File=/dev/nvidia[0-7]

有什么办法能知道哪里有问题么

跑了一下测试 (base) root@slurmcontroller:~/scow-slurm-adapter# go test -v tests/config/GetClusterConfig_test.go === RUN TestGetClusterConfig --- PASS: TestGetClusterConfig (0.07s) PASS ok command-line-arguments 0.081s

menkeyi001 commented 5 months ago

重新构建居然可以正常运行