ecohealthalliance / slurm-containerized-hpc-environment

The goal of this project is to outline the strategies for integrating Simple Linux Utility for Resource Management(SLURM) into our existing High-Performance Computing(HPC) platform.
0 stars 0 forks source link

Change architecture #2

Closed espirado closed 1 year ago

espirado commented 1 year ago

The initial approach seems not to communicate using the below architecture for communication within clusters 38642211-67a7e1a4-3da7-11e8-85a9-3394ad3c8cb6

External view external.pdf

espirado commented 1 year ago

Using FROM ghcr.io/ecohealthalliance/reservoir:gpu to build base image for controller and worker nodes. This will ensure all packages are available on the container nodes

espirado commented 1 year ago

3

espirado commented 1 year ago

controller | sacctmgr: error: Sending PersistInit msg: Connection refused worker01 | . controller exited with code 1 worker02 | . worker01 | . worker02 | . controller | /home/worker/.ssh/id_rsa already exists. controller | Overwrite (y/n)? controller | .ssh/ controller | .ssh/id_rsa.pub controller | .ssh/id_rsa controller | .ssh/authorized_keys controller | .ssh/config controller | /slurm-23.02.3 controller | munge.key controller | MUNGE:AwQFAACj/AzNLwyIIz/xEQqtMKzd1wMERCZYHeCMGJiUWMWS3kyOXxJjSThqOfNkhIym9K1SLJjdYd+TNaJTZzrD4JMJDjC0OpNMsjdKiz1zbh+ATbEpoyU2aBkgfTQqUONvz+8=: controller | STATUS: Success (0) controller | ENCODE_HOST: controller.local.dev (192.168.80.2) controller | ENCODE_TIME: 2023-07-29 02:42:16 -0400 (1690612936) controller | DECODE_TIME: 2023-07-29 02:42:16 -0400 (1690612936) controller | TTL: 300 controller | CIPHER: aes128 (4) controller | MAC: sha256 (5) controller | ZIP: none (0) controller | UID: root (0) controller | GID: root (0) controller | LENGTH: 0 controller | controller | controller | 2023-07-29 02:42:16 Spawning 1 thread for encoding controller | 2023-07-29 02:42:16 Processing credentials for 1 second worker01 | . worker02 | . controller | 2023-07-29 02:42:17 Processed 6092 credentials in 1.000s (6089 creds/sec) controller | cheking for slurmdbd.conf controller | ### generate slurm.conf ### controller | sacctmgr: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:database.local.dev:6819: Connection refused controller | sacctmgr: error: Sending PersistInit msg: Connection refused Still experiencing db connection issues with controller host. Working on fixing

espirado commented 1 year ago

Reservoir runs slurm but database keeps shutting down closing connection between workers and controller . Logs point to slurmdb and mysql instance startup [2023-07-31T07:12:23.627] error: mysql_real_connect failed: 2002 Can't connect to server on 'database.local.dev' (115) [2023-07-31T07:12:23.628] error: The database must be up when starting the MYSQL plugin. Trying again in 5 seconds.

espirado commented 1 year ago

We adopted a replica but running a single controller/database with multiple worker nodes