PKUHPC / OpenSCOW

Super Computing On Web
https://www.pkuscow.com/
Mulan Permissive Software License, Version 2
208 stars 45 forks source link

./cli compose up error and disk io problem #923

Closed ZYFno996 closed 1 year ago

ZYFno996 commented 1 year ago

Hi Developer,

I try to deploy SCOW according to the docs [https://pkuhpc.github.io/SCOW/docs/deploy], but some errors occur when i run ./cli compose up. I wish to have your help. Here are the details.

Envs

Centos 7 vm on Windows server 2019 Hyper-V. LDAP, Slurm, scow-slurm-adapter (compile from source code), scow-cli, docker have been installed successfully and passed tests.

Error 1

① I edited config files according to the deploy docs. But when i run ./cli compose up, and docker logs show that scow-mis-server-1 and scow-audit-server-1 have mysql connection error "Error: Host '172.18.0.x' is not allowed to connect to this MySQL server".

image image

② I run 'mysql -h -u root -p must!chang3this' (same to install.yaml::mis.dbPassword) but fail to connect.Then I user docker exec -it mysql to list the databases and users.

image

Error 2

About 5 mins after docker started, the centos system was blocked and the task manager of physical machine (winserver) shows a 100% io of the vm disk. 1b82937650db8f4a75b4ee58213e034 In order to locate the error docker image, i use docker start command to start them onebyone, and replicated the problem when starting scow-mis-web-1 or scow-audit-server-1. Other images are all ok.

Yaml config

install.yaml

portal:
  basePath: /

mis:
  basePath: "/mis"
  dbPassword: "myst!chang3this"

audit:
  dbPassword: "must!chang3this"

config/mis.yaml

db:
  host: db
  port: 3306
  user: root
  dbName: scow

fetchJobs:
  periodicFetch:
    cron: "10 */10 * * * *"8b

predefinedChargingTypes:
  -测试

config/audit.yaml

url: audit-server:5000

db:
  host: audit-db
  port: 3306
  user: root
  dbName: scow_audit

config/clusters/hpc01.yaml

displayName: hpc01

loginNodes:
  - name: login01
    address: login01

adapterUrl: localhost:8972

loginDesktop:
  enabled: true
  wms:
    - name: Xfce
      wm: xfce
  maxDesktops: 3
  desktopsDir: scow/desktops
turboVNCPath: /opt/TurboVNC

Now i have no more idea to launch scow. Help me, please. Thanks!

Best, ZYF

huangjun0210 commented 1 year ago

SCOW数据库没有生成,install.yaml你贴的的是全部的配置吗?如果是全部的配置应该是不全的,尝试下这个配置:

port: 80
basePath: /
imageTag: v1.0.0
portal:
  portMappings: {}
mis:
  dbPassword: must!chang3this
  portMappings: {}
log:
  fluentd:
    logDir: /var/log/fluentd
auth:
  portMappings: {}
audit:
  dbPassword: "must!chang3this"
gateway:
  proxyReadTimeout: 36000s
./cli compose down -v
./cli compose up -d

IO的问题取决于你宿主机的配置,以及其他应用是否有较大的IO需求,SCOW启动时或启动失败时不会有那么大量的读写操作

ZYFno996 commented 1 year ago

SCOW数据库没有生成,install.yaml你贴的的是全部的配置吗?如果是全部的配置应该是不全的,尝试下这个配置:

port: 80
basePath: /
imageTag: v1.0.0
portal:
  portMappings: {}
mis:
  dbPassword: must!chang3this
  portMappings: {}
log:
  fluentd:
    logDir: /var/log/fluentd
auth:
  portMappings: {}
audit:
  dbPassword: "must!chang3this"
gateway:
  proxyReadTimeout: 36000s
./cli compose down -v
./cli compose up -d

IO的问题取决于你宿主机的配置,以及其他应用是否有较大的IO需求,SCOW启动时或启动失败时不会有那么大量的读写操作

感谢回复。上述Error1按照您的方法更改配置,并清理了docker volume之后已修复。Error2 通过扩大内存到16GB也已经修复。 目前出现新的问题,scow-mis-server-1不断自动重启,显示报错为:

{
  code: 13,
  details: 'cluster: Error: 14 UNAVAILABLE: No connection established',
  metadata: Metadata {
    internalRepr: Map(2) {
      'is_scow_error' => [ '1' ],
      'scow_error_code' => [ 'CLUSTEROPS_ERROR' ]
    },
    options: {}
  }
}

经检查,scow-slurm-adapter正常运行在8972端口,mariadb服务正常,scontrol show node 均显示正常

huangjun0210 commented 1 year ago

这里报错是适配器连不上,检查三个地方:(1) 适配器,是否启动正常,核对config.yaml的配置;(2)scow集群配置文件,在SCOW部署节点config/clusters目录下,该处适配器配置是否正确?端口配置是否和第一个地方一致?(3)是否有防火墙,放开了了8972端口了没?

ZYFno996 commented 1 year ago

这里报错是适配器连不上,检查三个地方:(1) 适配器,是否启动正常,核对config.yaml的配置;(2)scow集群配置文件,在SCOW部署节点config/clusters目录下,该处适配器配置是否正确?端口配置是否和第一个地方一致?(3)是否有防火墙,放开了了8972端口了没?

(1)适配器启动正常,netstat 和 ps aux均能找到adapter。其日志server.log为空。 image image

(2)集群配置路径为clusters/hpc01.yml

displayName: hpc01

loginNodes:
  - name: login01
    address: login01
  #- name: login02
  #  address: login02

adapterUrl: localhost:8972

loginDesktop:
  enabled: true
  wms:
    - name: Xfce
      wm: xfce

  maxDesktops: 3
  desktopsDir: scow/desktops

(3)防火墙已关闭。

我尝试在install.yaml中关闭audit模块,对错误没有影响,依然报错。 我尝试把config/clusters/目录下的集群配置移除,则系统正常启动没有报错。放入即报错,错误日志显示“plugin: price”字样。但在报错之前,日志显示"msg":"Root can login to hpc01 by login node login01"。

image

我查看了其他类似issue,有网友遇到同样错误,原因包括数据库密码错误等,但已排除数据库问题。故求助

liu-shaobo commented 1 year ago

适配器的端口是不是一样?适配器里面的用户名、密码是否一样?

huangjun0210 commented 1 year ago

适配器的配置文件呢?适配器和scow在同一个节点呢?

huangjun0210 commented 1 year ago

错误是这个 image

ZYFno996 commented 1 year ago

适配器的配置文件呢?适配器和scow在同一个节点呢?

管理节点manage01搭载有mariadb、ldap、slurmctld、scow-slurm-adapter、scow。适配器路径/root/scow-slurm-adapter,scow路径/root/scow。 集群cluster包含登录节点login01、计算节点compute01

适配器配置config.yaml

mysql:
  host: 127.0.0.1
  port: 3306
  user: root
  dbname: slurm_acct_db
  password: "81SLURM@@rabGTjN7"
  clustername: cluster
  databaseencode: latin1

service:
  port: 8972

slurm:
  defaultqos: normal

modulepath:
  path: /data/software/module/tools/modules/init/profile.sh

slurm配置slurm.conf

# slurm.conf file. Please run configurator.html
# (in doc/html) to build a configuration file customized
# for your environment.
#
#
# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
################################################
#                   CONTROL                    #
################################################
ClusterName=cluster
SlurmctldHost=manage01
SlurmctldPort=6817
SlurmdPort=6818
SlurmUser=slurm
#SlurmdUser=root

################################################
#            LOGGING & OTHER PATHS             #
################################################
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurm/slurmd.log
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
SlurmdSpoolDir=/var/spool/slurmd
StateSaveLocation=/var/spool/slurmctld

################################################
#                  ACCOUNTING                  #
################################################
AccountingStorageEnforce=associations,limits,qos
AccountingStorageHost=manage0
AccountingStoragePass=/var/run/munge/munge.socket.2    
AccountingStoragePort=6819
AccountingStorageType=accounting_storage/slurmdbd

################################################
#                      JOBS                    #
################################################
JobCompHost=localhost
JobCompLoc=slurm_acct_db
JobCompPass=123456
JobCompPort=3306
JobCompType=jobcomp/mysql     
JobCompUser=slurm
JobContainerType=job_container/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux

################################################
#           SCHEDULING & ALLOCATION            #
################################################
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core

################################################
#                    TIMERS                    #
################################################
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0

################################################
#                    OTHER                     #
################################################
MpiDefault=none
ProctrackType=proctrack/cgroup
ReturnToService=1
SwitchType=switch/none
TaskPlugin=task/affinity

################################################
#                    NODES                     #
################################################
NodeName=manage01 NodeAddr=10.0.0.111 CPUs=4  CoresPerSocket=2 ThreadsPerCore=2 RealMemory=7785 State=UNKNOWN
NodeName=login01 NodeAddr=10.0.0.112  CPUs=2 CoresPerSocket=1 ThreadsPerCore=2 RealMemory=3753 Procs=1 State=UNKNOWN
NodeName=compute01 NodeAddr=10.0.0.113  CPUs=16 CoresPerSocket=8 ThreadsPerCore=2 RealMemory=31976 Procs=1 State=UNKNOWN

################################################
#                  PARTITIONS                  #
################################################
PartitionName=compute Nodes=compute01 Default=YES MaxTime=INFINITE State=UP
huangjun0210 commented 1 year ago

clusters/hpc01.yml这里的适配器地址将localhost改为ip试试

ZYFno996 commented 1 year ago

clusters/hpc01.yml这里的适配器地址将localhost改为ip试试

感谢,已经成功启动。