Azure / cyclecloud-slurm

Azure CycleCloud project to enable users to create, configure, and use Slurm HPC clusters.
MIT License
59 stars 43 forks source link

scheduler local mariadb fails db creation with hyphen or underscore in cluster name #279

Closed themorey closed 2 months ago

themorey commented 3 months ago

CC: 8.6.2 Slurm project: 3.0.7

STEPS:

ERRORS:

CC Portal:

CycleCloud Version: 8.6.2-3276
Cluster: jm_slurm_gpu (version 8.6.x)
==============================

Status: Error [Software Configuration] (retrying)
Start Time: 2024-08-29T18:02:45.519Z

Description: Unable to execute command `/opt/cycle/jetpack/system/bootstrap/azure-slurm-install/start-services.sh scheduler >> /opt/cycle/jetpack/system/bootstrap/azure-slurm-install/start-services.log 2>&1` (exit code 2)

Detail: 
STDOUT: 
STDERR:
EXCEPTION: ruby_block[defer_block_776dd7] (slurm::delayed_services line 7) had an error: Mixlib::ShellOut::ShellCommandFailed: execute[delayed_start_of_services] (slurm::delayed_services line 12) had an error: Mixlib::ShellOut::ShellCommandFailed: Expected process to exit with [0], but received '2'

Affected Nodes (1):
---
Node Name: scheduler
Hostname: jm-slurm-gpu-scheduler
IP Address: 10.10.0.5
Azure Resource ID: /subscriptions/1b3b5982-af09-406c-892a-e82c11b5cb9a/resourceGroups/jm_slurm_gpu-MVSGMMTCGA4DSLJVMUZDALJUHA/providers/Microsoft.Compute/virtualMachines/scheduler-GYZWCMLGGUYTCLJZMZRWGLJUGQ
Azure VM ID: 5e1d6f3d-f1c7-48df-8ab3-4849759b31a7
Cluster-Init: slurm:default:3.0.7, slurm:scheduler:3.0.7
Node ID: 1a8033ce-0afd-4979-bc73-6955d509a1da

slurmdbd.log:

[jmorey@jm-slurm-gpu-scheduler ~]$ sudo cat /var/log/slurmctld/slurmdbd.log
[2024-08-29T14:05:07.702] error: Unable to open pidfile `/var/run/slurmdbd.pid': Permission denied
[2024-08-29T14:05:07.702] Not running as root. Can't drop supplementary groups
[2024-08-29T14:05:07.709] fatal: mysql_query failed: 1064 You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near '-slurm-gpu_acct_db' at line 1
create database jm-slurm-gpu_acct_db

manual try to create in mariadb:

MariaDB [(none)]> create database jm-slurm-gpu_acct_db;
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near '-slurm-gpu_acct_db' at line 1
aditigaur4 commented 3 months ago

I ll send a PR for fixing this! meanwhile -- you can use the Database Name parameter in the CC UI that tells slurm the database name. It does so by setting the storageLoc parameter in slurmdbd.conf. When not set, it defaults to "clustername_acct_db". I ll send a PR for fixing the default case.

Screenshot 2024-08-29 141825

aditigaur4 commented 3 months ago

FYI the reason we convert cluster names and convert underscore to hyphens is because it's used to create nodenames which are then used to create hostnames and hostnames cannot have "_" character. We now use cluster name to also create a DB name. But in the database naming it's the opposite (cannot have hyphens), I have a PR that fixes this. But in a future release-- we will make the cluster naming convention a bit more strict to avoid this confusion.

aditigaur4 commented 2 months ago

This is fixed in master branch and workaround described above!