[Help] 添加2个GPU节点时，选择gpu分区，选项仍然是cpu分区的选项。

zhengkang2020 commented 3 weeks ago

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

[X] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

发生了什么 | What happened

slurm.conf 中加入2个gpu节点时，新建作业选择gpu分区后，选项为cpu分区中的选项；只保留一个gpu节点时正常显示gpu分区选项。 slurm.conf 中加入2个gpu节点时，新建作业选择gpu分区的截图：

只保留一个gpu节点时正常显示gpu分区：

期望结果 | What did you expect to happen

期望多个GPU节点都可以正常使用

之前运行正常吗？ | Did this work before?

之前单个gpu节点正常

复现方法 | Steps To Reproduce

1、slurm.conf添加gpu节点取消注释NodeName=hpc-g02、Partitions的Nodes=hpc-g01,hpc-g02这一行，注释Nodes=hpc-g01和Nodes=hpc-g02这一行时，出现以上情况/

......
NodeName=hpc-g01 NodeAddr=192.168.55.191  CPUs=192 CoresPerSocket=48 ThreadsPerCore=2 RealMemory=1547449 Sockets=2 State=UNKNOWN Gres=gpu:6
#NodeName=hpc-g02 NodeAddr=192.168.55.192  CPUs=192 CoresPerSocket=48 ThreadsPerCore=2 RealMemory=1547450 Sockets=2 State=UNKNOWN Gres=gpu:6

################################################
#                  PARTITIONS                  #
################################################
PartitionName=compute Nodes=hpc-c0[1-5] Default=YES MaxTime=INFINITE State=UP
PartitionName=gpu Nodes=hpc-g01 Default=NO MaxTime=INFINITE State=UP
#PartitionName=gpu Nodes=hpc-g01,hpc-g02 Default=NO MaxTime=INFINITE State=UP
#PartitionName=gpu Nodes=hpc-g02 Default=NO MaxTime=INFINITE State=UP

gres.conf中内容不变

AutoDetect=off
NodeName=hpc-g01 Name=gpu  File=/dev/nvidia[0-5]
NodeName=hpc-g02 Name=gpu  File=/dev/nvidia[0-5]

portal-server中log有报错不清楚有没有关系。

scow-portal-server-1  | Error: 14 UNAVAILABLE: No connection established. Last error: connect ECONNREFUSED 172.30.0.11:5000 (2024-10-17T00:03:35.565Z)
scow-portal-server-1  |     at callErrorFromStatus (/app/node_modules/.pnpm/@grpc+grpc-js@1.10.9/node_modules/@grpc/grpc-js/build/src/call.js:31:19)
scow-portal-server-1  |     at Object.onReceiveStatus (/app/node_modules/.pnpm/@grpc+grpc-js@1.10.9/node_modules/@grpc/grpc-js/build/src/client.js:193:76)
scow-portal-server-1  |     at Object.onReceiveStatus (/app/node_modules/.pnpm/@grpc+grpc-js@1.10.9/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:360:141)
scow-portal-server-1  |     at Object.onReceiveStatus (/app/node_modules/.pnpm/@grpc+grpc-js@1.10.9/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:323:181)
scow-portal-server-1  |     at /app/node_modules/.pnpm/@grpc+grpc-js@1.10.9/node_modules/@grpc/grpc-js/build/src/resolving-call.js:129:78
scow-portal-server-1  |     at process.processTicksAndRejections (node:internal/process/task_queues:77:11)
scow-portal-server-1  | for call at
scow-portal-server-1  |     at ServiceClientImpl.makeUnaryRequest (/app/node_modules/.pnpm/@grpc+grpc-js@1.10.9/node_modules/@grpc/grpc-js/build/src/client.js:161:32)
scow-portal-server-1  |     at ServiceClientImpl.getClustersRuntimeInfo (/app/node_modules/.pnpm/@grpc+grpc-js@1.10.9/node_modules/@grpc/grpc-js/build/src/make-client.js:105:19)
scow-portal-server-1  |     at /app/node_modules/.pnpm/@ddadaal+tsgrpc-client@0.17.7_@grpc+grpc-js@1.10.9/node_modules/@ddadaal/tsgrpc-client/lib/unary.js:18:13
scow-portal-server-1  |     at new Promise (<anonymous>)
scow-portal-server-1  |     at asyncClientCall (/app/node_modules/.pnpm/@ddadaal+tsgrpc-client@0.17.7_@grpc+grpc-js@1.10.9/node_modules/@ddadaal/tsgrpc-client/lib/unary.js:15:12)
scow-portal-server-1  |     at libGetCurrentActivatedClusters (/app/libs/server/build/misCommon/clustersActivation.js:37:61)
scow-portal-server-1  |     at createServer (/app/apps/portal-server/build/app.js:54:89)
scow-portal-server-1  |     at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
scow-portal-server-1  |     at async main (/app/apps/portal-server/build/index.js:16:20) {
scow-portal-server-1  |   code: 14,
scow-portal-server-1  |   details: 'No connection established. Last error: connect ECONNREFUSED 172.30.0.11:5000 (2024-10-17T00:03:35.565Z)',
scow-portal-server-1  |   metadata: Metadata { internalRepr: Map(0) {}, options: {} }
scow-portal-server-1  | }
scow-portal-server-1  | 
scow-portal-server-1  | Node.js v20.16.0

运行环境 | Environment

- OS: Rockylinux9.4
- Scheduler: slurm-23.02.6
- Docker: Docker version 24.0.7
- Docker-compose: Docker Compose version v2.23.3
- SCOW cli: 1.6.3
- SCOW: 1.6.3
- Adapter: 1.6

备注 | Anything else?

新加的gpu 驱动版本NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6

Miracle575 commented 3 weeks ago

请补充提供 getAvailablePartitionsForCluster 接口的响应

zhengkang2020 commented 3 weeks ago

常规：

请求 URL:
https://XXXX/api/job/getAvailablePartitionsForCluster?cluster=hpc01&accountName=mytest
请求方法:
GET
状态代码:
304 Not Modified
远程地址:
192.168.55.82:443
引用站点策略:
strict-origin-when-cross-origin

响应标头：

Miracle575 commented 3 weeks ago

可能更需要你的响应的具体内容而非响应头

zhengkang2020 commented 3 weeks ago

{
    "partitions": [
        {
            "name": "compute",
            "memMb": 2317628,
            "cores": 416,
            "gpus": 0,
            "nodes": 5,
            "qos": [
                "normal",
                "low",
                "high"
            ],
            "comment": ""
        },
        {
            "name": "gpu",
            "memMb": 3094899,
            "cores": 384,
            "gpus": 0,
            "nodes": 2,
            "qos": [
                "normal",
                "low",
                "high"
            ],
            "comment": ""
        }
    ]
}

会不会是集群上关于g02节点的配置有问题？

Miracle575 commented 3 weeks ago

ui 显示问题是因为接口返回的 gpus 为 0导致的。这个数据和 slurm 返回的数据有关。请在 slurm 节点执行 scontrol show node={替换为你的node节点名} | grep ' Gres=' | awk -F':' '{print $NF}'，然后提供返回

zhengkang2020 commented 3 weeks ago

[root@hpc-g01 ~]# scontrol show node=hpc-g02 | grep ' Gres=' | awk -F':' '{print $NF}'
6

Miracle575 commented 3 weeks ago

hpc-g01 呢，没有启用吗？

zhengkang2020 commented 3 weeks ago

[root@hpc-g01 ~]# scontrol show node=hpc-g01 | grep ' Gres=' | awk -F':' '{print $NF}'
6

都是6张卡的GPU服务器，查询结果是一样的

283713406 commented 3 weeks ago

请在 slurm 节点执行 scontrol show partition={替换为你的gpu分区名} | grep -i ' Nodes=' | awk -F'=' '{print $2}'，然后提供返回

zhengkang2020 commented 3 weeks ago

[root@hpc-g01 ~]# scontrol show partition=gpu | grep -i ' Nodes=' | awk -F'=' '{print $2}'
hpc-g[01-02]

283713406 commented 3 weeks ago

[root@hpc-g01 ~]# scontrol show partition=gpu | grep -i ' Nodes=' | awk -F'=' '{print $2}'
hpc-g[01-02]

可以将hpc-g[01-02]改成hpcGpu[01-02]吗？也就是[]前面的名字不能带-这个符号。改完后再来试试

zhengkang2020 commented 3 weeks ago

这样的话是得改gpu主机名把？为了统一命名方式，其他scow节点都得改？

将hpc-g[01-02]改成hpcGpu[01-02]，这个有什么说法吗？

283713406 commented 3 weeks ago

这样的话是得改gpu主机名把？为了统一命名方式，其他scow节点都得改？

将hpc-g[01-02]改成hpcGpu[01-02]，这个有什么说法吗？

目前适配器代码解析分区中的节点名是根据[和-来解析的，[]前面的前缀有-会导致解析不到正确的节点。

zhengkang2020 commented 3 weeks ago

只有一台hpc-g01时是正常的，但是会有这个问题 https://github.com/PKUHPC/OpenSCOW/issues/1000

与这个问题有关系吗？

283713406 commented 3 weeks ago

只有一台hpc-g01时是正常的，但是会有这个问题 #1000

与这个问题有关系吗？

这个没关系，cpu核心数是根据这个命令来拿的scontrol show partition=%s | grep TotalCPUs | awk '{print $2}' | awk -F'=' '{print $2}'

zhengkang2020 commented 3 weeks ago

如果修改主机名会导致当前running的gpu作业出现问题，最好修改前能看下适配器中会不会有解析到的GPU节点的log？

283713406 commented 3 weeks ago

如果修改主机名会导致当前running的gpu作业出现问题，最好修改前能看下适配器中会不会有解析到的GPU节点的log？

你那边是自己编译的适配器吗？可以重新拉下适配器代码并重新编译适配器并替换一下，应该就能解决这个问题。

zhengkang2020 commented 3 weeks ago

是自己编译的适配器，目前是编译的1.6版本的适配器，最近有修复这个问题吗？

283713406 commented 3 weeks ago

是自己编译的适配器，目前是编译的1.6版本的适配器，最近有修复这个问题吗？

https://github.com/PKUHPC/scow-slurm-adapter/pull/23 。这个pr是修复这个问题的，你可以对照移植到你下载好的1.6版本的代码中在编译下

zhengkang2020 commented 3 weeks ago

已解决问题，谢谢。

PKUHPC / OpenSCOW