PKUHPC / OpenSCOW

Super Computing On Web
https://www.pkuscow.com/
Mulan Permissive Software License, Version 2
220 stars 49 forks source link

[Bug/Help] mis: 设置分区QOS后队列计费不再支持按作业QOS设置 #1461

Open Cloudac7 opened 4 days ago

Cloudac7 commented 4 days ago

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

发生了什么 | What happened

由于内部管理的需要,我们通过Slurm对GPU分区设置了分区QOS(gpu_qos),独立于全局的作业QOS(normallong),主要用于限制用户每次提交GPU作业至少申请1张GPU卡,尽可能提高效率。由于全局QOS同时作用于CPU分区,故该设置无法在作业QOS中分别设置,在实践上只能采用上述策略。

image

但如图所示,在设置了该QOS后,SCOW系统中无法对GPU分区计费分别针对normallong进行设置,且查询后台发现系统仍在正常扣费。

期望结果 | What did you expect to happen

可以正确对normallongQOS分别设置计费

之前运行正常吗? | Did this work before?

正常。v1.6.3

复现方法 | Steps To Reproduce

进入SCOW的作业价格表设置

运行环境 | Environment

- OS: KOS 5.8
- Scheduler: Slurm 21.08.8
- Docker: 26.1.3
- Docker-compose: 26.1.3
- SCOW cli: 1.6.3
- SCOW: 1.6.3
- Adapter: 1.6.0

备注 | Anything else?

No response

piccaSun commented 4 days ago

请您方便的话,通过scontrol show partition提供与上述qos相关的分区配置下的qos信息; 以及通过sacctmgr show qos format=Name,Partition提供各qos的分区配置信息, 来帮助我们进一步排查上述问题

Cloudac7 commented 4 days ago

请您方便的话,通过scontrol show partition提供与上述qos相关的分区配置下的qos信息; 以及通过sacctmgr show qos format=Name,Partition提供各qos的分区配置信息, 来帮助我们进一步排查上述问题

scontrol show partition输出如下:

# scontrol show partition
PartitionName=cpu
   AllowGroups=ALL AllowAccounts=ai4ecaig,ai4ecailoc,ai4ecall,ai4ecccg,ai4ecctmig,ai4ececg,ai4eceeg,ai4ecepg,ai4ecmig,ai4ecnimte,baoljgroup,bnulizdgroup,brengroup,caogroup,caoshgroup,caoxrgroup,caoxygroup,caozxgroup,cfdai,chenggroup,chengjungroup,chenhygroup,chenlingroup,chxgroup,cpddai,csygroup,dengxianming,dicpyuliang,dpikkem,duanamgroup,dwzhougroup,fangngroup,fengmingbaogroup,gonglgroup,gxpgroup,hciscgroup,houxugroup,hthiumtest,huanghlgroup,huangjlgroup,huangqlgroup,huangweigroup,huangwengroup,hujungroup,hwjgroup,jfligroup,jinyugroup,kechgroup,lichgroup,lijinggroup,lintianweigroup,liswgroup,liugkgroup,liuhygroup,liyegroup,luoyuanronggroup,luweihuagroup,lvtygroup,maruigroup,maslgroup,mengcgroup,mslgroup,nfanggroup,nfzhenggroup,pavlogroup,qgzhanggroup,qikaigroup,rjxiegroup,shuaiwanggroup,songkaixingroup,sungroup,test,test1,test2,tianxingwugroup,tianygroup,tuzhangroup,txionggroup,ustbhushuxian,wangcgroup,wangjgroup,wangslgroup,wangtinggroup,wbjgroup,wcgroup,wenyhgroup,wucxgroup,wusqgroup,xinlugroup,xmuchemcamp,xmuewccgroup,xmuldk,xuehuijiegroup,yigroup,yijgroup,yixiaodonggroup,youycgroup,yuhrgroup,yushilingroup,ywjianggroup,zenghuabingroup,zhandpgroup,zhanghcgroup,zhangqianggroup,zhangyygroup,zhangzengkaigroup,zhangzhgroup,zhangzhongnangroup,zhaohonggroup,zhaoyungroup,zhengqianggroup,zhengxhgroup,zhouweigroup,zhujungroup,zhuzzgroup,zlonggroup,zpmaogroup,ai4ecxmri,ai4ecgeely,zhanghuiminggroup,ai4ec,ai4ecspectr,sunyfgroup,lishaobingroup,huangkgroup,fugroup AllowQos=normal,long
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=cu[001-389]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=24896 TotalNodes=389 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=1024 MaxMemPerCPU=4096

PartitionName=gpu
   AllowGroups=ALL AllowAccounts=ai4ecaig,ai4ecailoc,ai4ecall,ai4ecccg,ai4ecctmig,ai4ececg,ai4eceeg,ai4ecepg,ai4ecmig,ai4ecnimte,baoljgroup,bnulizdgroup,brengroup,caogroup,caoshgroup,caoxrgroup,caoxygroup,caozxgroup,cfdai,chenggroup,chengjungroup,chenhygroup,chenlingroup,chxgroup,cpddai,csygroup,dengxianming,dicpyuliang,dpikkem,duanamgroup,dwzhougroup,fangngroup,fengmingbaogroup,gonglgroup,gxpgroup,hciscgroup,houxugroup,hthiumtest,huanghlgroup,huangjlgroup,huangqlgroup,huangweigroup,huangwengroup,hujungroup,hwjgroup,jfligroup,jinyugroup,kechgroup,lichgroup,lijinggroup,lintianweigroup,liswgroup,liugkgroup,liuhygroup,liyegroup,luoyuanronggroup,luweihuagroup,lvtygroup,maruigroup,maslgroup,mengcgroup,mslgroup,nfanggroup,nfzhenggroup,pavlogroup,qgzhanggroup,qikaigroup,rjxiegroup,shuaiwanggroup,songkaixingroup,sungroup,test,test1,test2,tianxingwugroup,tianygroup,tuzhangroup,txionggroup,ustbhushuxian,wangcgroup,wangjgroup,wangslgroup,wangtinggroup,wbjgroup,wcgroup,wenyhgroup,wucxgroup,wusqgroup,xinlugroup,xmuchemcamp,xmuewccgroup,xmuldk,xuehuijiegroup,yigroup,yijgroup,yixiaodonggroup,youycgroup,yuhrgroup,yushilingroup,ywjianggroup,zenghuabingroup,zhandpgroup,zhanghcgroup,zhangqianggroup,zhangyygroup,zhangzengkaigroup,zhangzhgroup,zhangzhongnangroup,zhaohonggroup,zhaoyungroup,zhengqianggroup,zhengxhgroup,zhouweigroup,zhujungroup,zhuzzgroup,zlonggroup,zpmaogroup,ai4ecxmri,ai4ecgeely,zhanghuiminggroup,ai4ec,ai4ecspectr,sunyfgroup,lishaobingroup,huangkgroup,fugroup AllowQos=normal,long
   AllocNodes=ALL Default=NO QoS=gpu_qos
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=gpu[001-006]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=384 TotalNodes=6 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=16384 MaxMemPerCPU=24576

PartitionName=fat
   AllowGroups=ALL AllowAccounts=ai4ecaig,ai4ecailoc,ai4ecall,ai4ecccg,ai4ecctmig,ai4ececg,ai4eceeg,ai4ecepg,ai4ecmig,ai4ecnimte,baoljgroup,bnulizdgroup,brengroup,caogroup,caoshgroup,caoxrgroup,caoxygroup,caozxgroup,cfdai,chenggroup,chengjungroup,chenhygroup,chenlingroup,chxgroup,cpddai,csygroup,dengxianming,dicpyuliang,dpikkem,duanamgroup,dwzhougroup,fangngroup,fengmingbaogroup,gonglgroup,gxpgroup,hciscgroup,houxugroup,hthiumtest,huanghlgroup,huangjlgroup,huangqlgroup,huangweigroup,huangwengroup,hujungroup,hwjgroup,jfligroup,jinyugroup,kechgroup,lichgroup,lijinggroup,lintianweigroup,liswgroup,liugkgroup,liuhygroup,liyegroup,luoyuanronggroup,luweihuagroup,lvtygroup,maruigroup,maslgroup,mengcgroup,mslgroup,nfanggroup,nfzhenggroup,pavlogroup,qgzhanggroup,qikaigroup,rjxiegroup,shuaiwanggroup,songkaixingroup,sungroup,test,test1,test2,tianxingwugroup,tianygroup,tuzhangroup,txionggroup,ustbhushuxian,wangcgroup,wangjgroup,wangslgroup,wangtinggroup,wbjgroup,wcgroup,wenyhgroup,wucxgroup,wusqgroup,xinlugroup,xmuchemcamp,xmuewccgroup,xmuldk,xuehuijiegroup,yigroup,yijgroup,yixiaodonggroup,youycgroup,yuhrgroup,yushilingroup,ywjianggroup,zenghuabingroup,zhandpgroup,zhanghcgroup,zhangqianggroup,zhangyygroup,zhangzengkaigroup,zhangzhgroup,zhangzhongnangroup,zhaohonggroup,zhaoyungroup,zhengqianggroup,zhengxhgroup,zhouweigroup,zhujungroup,zhuzzgroup,zlonggroup,zpmaogroup,ai4ecxmri,ai4ecgeely,zhanghuiminggroup,ai4ec,ai4ecspectr,sunyfgroup,lishaobingroup,huangkgroup,fugroup AllowQos=normal,long
   AllocNodes=ALL Default=YES QoS=N/A
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=fat[001-002]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=128 TotalNodes=2 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=8192 MaxMemPerCPU=32768

PartitionName=dpcpu
   AllowGroups=dpikkem AllowAccounts=ai4ecaig,ai4ecailoc,ai4ecall,ai4ecccg,ai4ecctmig,ai4ececg,ai4eceeg,ai4ecepg,ai4ecmig,ai4ecnimte,baoljgroup,bnulizdgroup,brengroup,caogroup,caoshgroup,caoxrgroup,caoxygroup,caozxgroup,cfdai,chenggroup,chengjungroup,chenhygroup,chenlingroup,chxgroup,cpddai,csygroup,dengxianming,dicpyuliang,dpikkem,duanamgroup,dwzhougroup,fangngroup,fengmingbaogroup,gonglgroup,gxpgroup,hciscgroup,houxugroup,hthiumtest,huanghlgroup,huangjlgroup,huangqlgroup,huangweigroup,huangwengroup,hujungroup,hwjgroup,jfligroup,jinyugroup,kechgroup,lichgroup,lijinggroup,lintianweigroup,liswgroup,liugkgroup,liuhygroup,liyegroup,luoyuanronggroup,luweihuagroup,lvtygroup,maruigroup,maslgroup,mengcgroup,mslgroup,nfanggroup,nfzhenggroup,pavlogroup,qgzhanggroup,qikaigroup,rjxiegroup,shuaiwanggroup,songkaixingroup,sungroup,test,test1,test2,tianxingwugroup,tianygroup,tuzhangroup,txionggroup,ustbhushuxian,wangcgroup,wangjgroup,wangslgroup,wangtinggroup,wbjgroup,wcgroup,wenyhgroup,wucxgroup,wusqgroup,xinlugroup,xmuchemcamp,xmuewccgroup,xmuldk,xuehuijiegroup,yigroup,yijgroup,yixiaodonggroup,youycgroup,yuhrgroup,yushilingroup,ywjianggroup,zenghuabingroup,zhandpgroup,zhanghcgroup,zhangqianggroup,zhangyygroup,zhangzengkaigroup,zhangzhgroup,zhangzhongnangroup,zhaohonggroup,zhaoyungroup,zhengqianggroup,zhengxhgroup,zhouweigroup,zhujungroup,zhuzzgroup,zlonggroup,zpmaogroup,ai4ecxmri,ai4ecgeely,zhanghuiminggroup,ai4ec,ai4ecspectr,sunyfgroup,lishaobingroup,huangkgroup,fugroup AllowQos=unlimit
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=cu[001-300]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=19200 TotalNodes=300 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=1024 MaxMemPerCPU=4096

PartitionName=dpgpu
   AllowGroups=dpikkem AllowAccounts=ai4ecaig,ai4ecailoc,ai4ecall,ai4ecccg,ai4ecctmig,ai4ececg,ai4eceeg,ai4ecepg,ai4ecmig,ai4ecnimte,baoljgroup,bnulizdgroup,brengroup,caogroup,caoshgroup,caoxrgroup,caoxygroup,caozxgroup,cfdai,chenggroup,chengjungroup,chenhygroup,chenlingroup,chxgroup,cpddai,csygroup,dengxianming,dicpyuliang,dpikkem,duanamgroup,dwzhougroup,fangngroup,fengmingbaogroup,gonglgroup,gxpgroup,hciscgroup,houxugroup,hthiumtest,huanghlgroup,huangjlgroup,huangqlgroup,huangweigroup,huangwengroup,hujungroup,hwjgroup,jfligroup,jinyugroup,kechgroup,lichgroup,lijinggroup,lintianweigroup,liswgroup,liugkgroup,liuhygroup,liyegroup,luoyuanronggroup,luweihuagroup,lvtygroup,maruigroup,maslgroup,mengcgroup,mslgroup,nfanggroup,nfzhenggroup,pavlogroup,qgzhanggroup,qikaigroup,rjxiegroup,shuaiwanggroup,songkaixingroup,sungroup,test,test1,test2,tianxingwugroup,tianygroup,tuzhangroup,txionggroup,ustbhushuxian,wangcgroup,wangjgroup,wangslgroup,wangtinggroup,wbjgroup,wcgroup,wenyhgroup,wucxgroup,wusqgroup,xinlugroup,xmuchemcamp,xmuewccgroup,xmuldk,xuehuijiegroup,yigroup,yijgroup,yixiaodonggroup,youycgroup,yuhrgroup,yushilingroup,ywjianggroup,zenghuabingroup,zhandpgroup,zhanghcgroup,zhangqianggroup,zhangyygroup,zhangzengkaigroup,zhangzhgroup,zhangzhongnangroup,zhaohonggroup,zhaoyungroup,zhengqianggroup,zhengxhgroup,zhouweigroup,zhujungroup,zhuzzgroup,zlonggroup,zpmaogroup,ai4ecxmri,ai4ecgeely,zhanghuiminggroup,ai4ec,ai4ecspectr,sunyfgroup,lishaobingroup,huangkgroup,fugroup AllowQos=unlimit
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=gpu[001-006]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=384 TotalNodes=6 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=16384 MaxMemPerCPU=24576

以及 sacctmgr show qos format=Name,Partition 的输出(为了方便查看我这边同时把MaxWallMinTRES%20也导出了)

# sacctmgr show qos format=Name,Partition,MaxWall,MinTRES%20
      Name  Partition     MaxWall              MinTRES
---------- ---------- ----------- --------------------
    normal             2-00:00:00
      long             4-00:00:00
   unlimit
   gpu_qos                            gres/gpu:tesla=1
piccaSun commented 4 days ago

感谢您的回复。从您分区信息可以看到,PartitionName=gpuAllowQos=normal, long, Qos=gpu_qos这个是不合理的配置,会导致读取分区qos时出现错误。

为了解决您现在的问题,我们建议您不指定Qos的值,即仍然配置为N/A;且需要让AllowQos= normal, long, gpu_qos,请确认正确配置Qos后是否仍然有上述分区Qos价格配置问题或计费问题存在。

如果您有指定默认Qos的需求,我们会考虑后续对此做出完善。

Cloudac7 commented 4 days ago

感谢您的回复。从您分区信息可以看到,PartitionName=gpuAllowQos=normal, long, Qos=gpu_qos这个是不合理的配置,会导致读取分区qos时出现错误。

为了解决您现在的问题,我们建议您不指定Qos的值,即仍然配置为N/A;且需要让AllowQos= normal, long, gpu_qos,请确认正确配置Qos后是否仍然有上述分区Qos价格配置问题或计费问题存在。

如果您有指定默认Qos的需求,我们会考虑后续对此做出完善。

但正如本issue的introduction部分介绍,gpu_qos 的作用是确保用户在使用GPU队列时最少需要指定1张GPU卡,与另外两个QOS彼此独立,且CPU队列不需要这一设置。Partition QOS和Job QOS在Slurm官方文档中的说明关系也是如此。

我们的需求并非指定默认QoS,而是希望对这个队列指定一个独立于其他队列的政策要求。

piccaSun commented 3 days ago

感谢您对问题的补充。

首先,当前OpenSCOW中,我们默认在页面提交作业/创建交互式应用使用GPU分区时,至少需要选择1张以上的GPU卡; 同时我们也建议针对特殊的分区制定策略时,对整个分区进行配置,例如直接对您提到的GPU分区做出MinTres策略限制

其次,针对您现在遇到的问题,我们当前不支持对Partition下的AllowQos和Qos分别指定,如果您想给GPU分区制定独立于全局的normal 和 long 的qos, 建议您可以通过单独给对应分区指定 normal-long-qos, long-gpu-qos的AllowQos

请确认上述回复能否解决您的问题。

最后,您提到后台作业仍然正常扣费,请您帮助确认该扣费是否是在租户管理和平台管理下未对 GPU分区的 任何qos设置价格时发生的扣费?在您分区配置为 Qos = gpu-qos, AllowQos= normal, long的条件下,集群下写入数据库并发生扣费的 GPU分区下的 Qos 为那个 Qos

Cloudac7 commented 3 days ago

感谢您对问题的补充。

首先,当前OpenSCOW中,我们默认在页面提交作业/创建交互式应用使用GPU分区时,至少需要选择1张以上的GPU卡; 同时我们也建议针对特殊的分区制定策略时,对整个分区进行配置,例如直接对您提到的GPU分区做出MinTres策略限制

其次,针对您现在遇到的问题,我们当前不支持对Partition下的AllowQos和Qos分别指定,如果您想给GPU分区制定独立于全局的normal 和 long 的qos, 建议您可以通过单独给对应分区指定 normal-long-qos, long-gpu-qos的AllowQos

请确认上述回复能否解决您的问题。

最后,您提到后台作业仍然正常扣费,请您帮助确认该扣费是否是在租户管理和平台管理下未对 GPU分区的 任何qos设置价格时发生的扣费?在您分区配置为 Qos = gpu-qos, AllowQos= normal, long的条件下,集群下写入数据库并发生扣费的 GPU分区下的 Qos 为那个 Qos

首先谢谢您的回答。

关于前者,首先用户的使用习惯大部分时候还是会通过命令行创建作业,因此我们需要在Slurm层面上做限制。第二,Slurm不支持对分区设置MinTres策略限制,仅可通过配置分区QOS来实现,这点我想或许也是分区QOS功能存在的目的。

独立于原本设置创建新的QOS则会要求用户改变使用习惯,从运营角度我们自然希望尽可能不影响用户,因此在实践上这类影响用户的变更,我们需要内部进一步讨论决定。而从技术上来说,在Slurm策略设置上既然存在推荐的解决方案,削足适履可能也并不一定是最优解。

然后关于后者,我们确实还没有做新设置的变更,即维持了原本GPU分区对应normal和long QOS的定价。目前也是原打算做变更的时候发现这一问题的存在。

piccaSun commented 2 days ago

分区确实不直接支持MinTres设置,感谢提醒。

关于计费推测由于您可能没有在变更Qos配置时重启OpenScow,数据库中保留了原有计费规则,所以导致如果可以正常使用normal和long的 Qos的情况提交作业时,仍然使用了原有计费规则。

目前我们在Slurm适配器中采用的策略是如果设置了PartitionQos默认情况是让用户只使用PartitionQos来提交作业,所以这时只支持对PartitionQos来设置计费规则 如果没有设置PartitionQos的情况,才支持对AllowQos下所有Qos设置计费规则

如果有迫切的需要您可以在我们提供的Slurm开源适配器的基础上进行改动满足您使用的需求 我们也非常感谢您宝贵的意见,会基于此进一步内部讨论优化更全面的Slurm设置的实现