Closed ZYFno996 closed 1 year ago
SCOW数据库没有生成,install.yaml你贴的的是全部的配置吗?如果是全部的配置应该是不全的,尝试下这个配置:
port: 80
basePath: /
imageTag: v1.0.0
portal:
portMappings: {}
mis:
dbPassword: must!chang3this
portMappings: {}
log:
fluentd:
logDir: /var/log/fluentd
auth:
portMappings: {}
audit:
dbPassword: "must!chang3this"
gateway:
proxyReadTimeout: 36000s
./cli compose down -v
./cli compose up -d
IO的问题取决于你宿主机的配置,以及其他应用是否有较大的IO需求,SCOW启动时或启动失败时不会有那么大量的读写操作
SCOW数据库没有生成,install.yaml你贴的的是全部的配置吗?如果是全部的配置应该是不全的,尝试下这个配置:
port: 80 basePath: / imageTag: v1.0.0 portal: portMappings: {} mis: dbPassword: must!chang3this portMappings: {} log: fluentd: logDir: /var/log/fluentd auth: portMappings: {} audit: dbPassword: "must!chang3this" gateway: proxyReadTimeout: 36000s
./cli compose down -v ./cli compose up -d
IO的问题取决于你宿主机的配置,以及其他应用是否有较大的IO需求,SCOW启动时或启动失败时不会有那么大量的读写操作
感谢回复。上述Error1按照您的方法更改配置,并清理了docker volume之后已修复。Error2 通过扩大内存到16GB也已经修复。 目前出现新的问题,scow-mis-server-1不断自动重启,显示报错为:
{
code: 13,
details: 'cluster: Error: 14 UNAVAILABLE: No connection established',
metadata: Metadata {
internalRepr: Map(2) {
'is_scow_error' => [ '1' ],
'scow_error_code' => [ 'CLUSTEROPS_ERROR' ]
},
options: {}
}
}
经检查,scow-slurm-adapter正常运行在8972端口,mariadb服务正常,scontrol show node 均显示正常
这里报错是适配器连不上,检查三个地方:(1) 适配器,是否启动正常,核对config.yaml的配置;(2)scow集群配置文件,在SCOW部署节点config/clusters
目录下,该处适配器配置是否正确?端口配置是否和第一个地方一致?(3)是否有防火墙,放开了了8972端口了没?
这里报错是适配器连不上,检查三个地方:(1) 适配器,是否启动正常,核对config.yaml的配置;(2)scow集群配置文件,在SCOW部署节点
config/clusters
目录下,该处适配器配置是否正确?端口配置是否和第一个地方一致?(3)是否有防火墙,放开了了8972端口了没?
(1)适配器启动正常,netstat 和 ps aux均能找到adapter。其日志server.log为空。
(2)集群配置路径为clusters/hpc01.yml
displayName: hpc01
loginNodes:
- name: login01
address: login01
#- name: login02
# address: login02
adapterUrl: localhost:8972
loginDesktop:
enabled: true
wms:
- name: Xfce
wm: xfce
maxDesktops: 3
desktopsDir: scow/desktops
(3)防火墙已关闭。
我尝试在install.yaml中关闭audit模块,对错误没有影响,依然报错。 我尝试把config/clusters/目录下的集群配置移除,则系统正常启动没有报错。放入即报错,错误日志显示“plugin: price”字样。但在报错之前,日志显示"msg":"Root can login to hpc01 by login node login01"。
我查看了其他类似issue,有网友遇到同样错误,原因包括数据库密码错误等,但已排除数据库问题。故求助
适配器的端口是不是一样?适配器里面的用户名、密码是否一样?
适配器的配置文件呢?适配器和scow在同一个节点呢?
错误是这个
适配器的配置文件呢?适配器和scow在同一个节点呢?
管理节点manage01搭载有mariadb、ldap、slurmctld、scow-slurm-adapter、scow。适配器路径/root/scow-slurm-adapter,scow路径/root/scow。 集群cluster包含登录节点login01、计算节点compute01
适配器配置config.yaml
mysql:
host: 127.0.0.1
port: 3306
user: root
dbname: slurm_acct_db
password: "81SLURM@@rabGTjN7"
clustername: cluster
databaseencode: latin1
service:
port: 8972
slurm:
defaultqos: normal
modulepath:
path: /data/software/module/tools/modules/init/profile.sh
slurm配置slurm.conf
# slurm.conf file. Please run configurator.html
# (in doc/html) to build a configuration file customized
# for your environment.
#
#
# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
################################################
# CONTROL #
################################################
ClusterName=cluster
SlurmctldHost=manage01
SlurmctldPort=6817
SlurmdPort=6818
SlurmUser=slurm
#SlurmdUser=root
################################################
# LOGGING & OTHER PATHS #
################################################
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurm/slurmd.log
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
SlurmdSpoolDir=/var/spool/slurmd
StateSaveLocation=/var/spool/slurmctld
################################################
# ACCOUNTING #
################################################
AccountingStorageEnforce=associations,limits,qos
AccountingStorageHost=manage0
AccountingStoragePass=/var/run/munge/munge.socket.2
AccountingStoragePort=6819
AccountingStorageType=accounting_storage/slurmdbd
################################################
# JOBS #
################################################
JobCompHost=localhost
JobCompLoc=slurm_acct_db
JobCompPass=123456
JobCompPort=3306
JobCompType=jobcomp/mysql
JobCompUser=slurm
JobContainerType=job_container/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
################################################
# SCHEDULING & ALLOCATION #
################################################
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
################################################
# TIMERS #
################################################
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
################################################
# OTHER #
################################################
MpiDefault=none
ProctrackType=proctrack/cgroup
ReturnToService=1
SwitchType=switch/none
TaskPlugin=task/affinity
################################################
# NODES #
################################################
NodeName=manage01 NodeAddr=10.0.0.111 CPUs=4 CoresPerSocket=2 ThreadsPerCore=2 RealMemory=7785 State=UNKNOWN
NodeName=login01 NodeAddr=10.0.0.112 CPUs=2 CoresPerSocket=1 ThreadsPerCore=2 RealMemory=3753 Procs=1 State=UNKNOWN
NodeName=compute01 NodeAddr=10.0.0.113 CPUs=16 CoresPerSocket=8 ThreadsPerCore=2 RealMemory=31976 Procs=1 State=UNKNOWN
################################################
# PARTITIONS #
################################################
PartitionName=compute Nodes=compute01 Default=YES MaxTime=INFINITE State=UP
clusters/hpc01.yml这里的适配器地址将localhost改为ip试试
clusters/hpc01.yml这里的适配器地址将localhost改为ip试试
感谢,已经成功启动。
Hi Developer,
I try to deploy SCOW according to the docs [https://pkuhpc.github.io/SCOW/docs/deploy], but some errors occur when i run ./cli compose up. I wish to have your help. Here are the details.
Envs
Centos 7 vm on Windows server 2019 Hyper-V. LDAP, Slurm, scow-slurm-adapter (compile from source code), scow-cli, docker have been installed successfully and passed tests.
Error 1
① I edited config files according to the deploy docs. But when i run ./cli compose up, and docker logs show that scow-mis-server-1 and scow-audit-server-1 have mysql connection error "Error: Host '172.18.0.x' is not allowed to connect to this MySQL server".
② I run 'mysql -h -u root -p must!chang3this' (same to install.yaml::mis.dbPassword) but fail to connect.Then I user docker exec -it mysql to list the databases and users.
Error 2
About 5 mins after docker started, the centos system was blocked and the task manager of physical machine (winserver) shows a 100% io of the vm disk. In order to locate the error docker image, i use docker start command to start them onebyone, and replicated the problem when starting scow-mis-web-1 or scow-audit-server-1. Other images are all ok.
Yaml config
install.yaml
config/mis.yaml
config/audit.yaml
config/clusters/hpc01.yaml
Now i have no more idea to launch scow. Help me, please. Thanks!
Best, ZYF