Azure / cyclecloud-slurm

Azure CycleCloud project to enable users to create, configure, and use Slurm HPC clusters.
MIT License
55 stars 42 forks source link

MySQL certificate used for SlurmDBD is changed again. #212

Closed vinil-v closed 5 months ago

vinil-v commented 6 months ago

Issue Description:

Customer encountered issues with their Slurm cluster when attempting to enable Job Accounting feature in CycleCloud 8.5 with Slurm 3.0.5. The environment includes a custom OS based on CentOS 7 and an Azure Managed MySQL Flexible server as SlurmDB. The error occurs during the execution of a command related to starting services, resulting in an exit code 2.

Error Details:

Unable to execute command `/opt/cycle/jetpack/system/bootstrap/azure-slurm-install/start-services.sh scheduler >> /opt/cycle/jetpack/system/bootstrap/azure-slurm-install/start-services.log 2>&1` (exit code 2)

Software Configuration:

Error Output:

/opt/cycle/jetpack/system/bootstrap/azure-slurm-install/start-services.log:

[2024-02-16T03:16:16.311] error:  mpi/pmix_v4: init: (null) [0]: mpi_pmix.c:197: pmi/pmix: can not load PMIx library
[2024-02-16T03:16:16.311] error: Couldn't load specified plugin name for mpi/pmix_v4: Plugin init() callback failed
[2024-02-16T03:16:16.311] error: MPI: Cannot create context for mpi/pmix_v4
...
[2024-02-16T03:16:16.313] fatal: You are running with a database but for some reason we have no TRES from it. This should only happen if the database is down and you don't have any state files.

Resolution:

The issue was traced to a discrepancy in the MySQL certificate (AzureCA.pem) provided in the cyclecloud-slurm project. Upon comparison with the public certificate DigiCertGlobalRootCA.crt.pem, it was observed that they were different. The problem was resolved by updating the SSL certificate URL to https://dl.cacerts.digicert.com/DigiCertGlobalRootCA.crt.pem.

Lab Test Results:

[root@ip-0ADE011A ~]# cat /etc/centos-release
CentOS Linux release 7.7.1908 (Core)
[root@ip-0ADE011A ~]# sinfo --version
slurm 23.02.6
[root@ip-0ADE011A ~]# jetpack config cyclecloud.cluster_init_specs --json | egrep 'project\"|version'
            "project": "slurm",
            "version": "3.0.5"
            "project": "slurm",
            "version": "3.0.5"
[root@ip-0ADE011A ~]# jetpack config slurm.accounting.certificate_url
https://dl.cacerts.digicert.com/DigiCertGlobalRootCA.crt.pem
[root@ip-0ADE011A ~]# sacct
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------

Action Required:

Please update the certificate in the cyclecloud-slurm project to resolve this issue.

aditigaur4 commented 5 months ago

when we create a CC slurm cluster, there is a tab in the accounting field to add a link to the SSL cert, does adding the link https://dl.cacerts.digicert.com/DigiCertGlobalRootCA.crt.pem there work?

aditigaur4 commented 5 months ago

there is a tab in the UI to add a link:

Screenshot 2024-02-29 100237
vinil-v commented 5 months ago

@aditigaur4 - Confirmed, the functionality is operational as outlined in the resolution section. However, I'd like to highlight that the AzureCA.pem file within the cyclecloud project is obsolete and ineffective. It requires updating to ensure proper functionality. It's advisable to establish a mechanism for updating or rotating the MySQL certificate within the cyclecloud-slurm project.

aditigaur4 commented 5 months ago

We will keep that in mind for the CC re-design but for now, the only advisable solution is to use that URL.