Azure / az-hop

The Azure HPC On-Demand Platform provides an HPC Cluster Ready solution
https://azure.github.io/az-hop/
MIT License
65 stars 54 forks source link

`sacct` does not work on `ondemand` with cc-slurm 3.x #1857

Closed ltalirz closed 7 months ago

ltalirz commented 7 months ago

Version

1.0.40

In what area(s)?

/area administration /area ansible /area autoscaling /area configuration /area cyclecloud /area documentation /area image /area job-scheduling /area monitoring /area ood /area remote-visualization /area user-management

Expected Behavior

sacct should work on the ondemand node

Actual Behavior

$ sacct
sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:localhost:6819: Connection refused
sacct: error: Sending PersistInit msg: Connection refused
sacct: error: Problem talking to the database: Connection refused

Steps to Reproduce the Problem

install az-hop with cc-slurm 3.x and slurm 23.x

Solution

The problem is that /anfhome/slurm/config/accounting.conf is configured to point to localhost:

AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost="localhost"
AccountingStorageTRES=gres/gpu

However, slurmdbd only runs on the scheduler node (sacct works fine there).

To fix, change localhost to {{ scheduler.name }} from the config file. (there used to be logic for this in the slurm.conf.j2 template, but it seems this is no longer used with cc-slurm 3.x)

xpillons commented 7 months ago

I've open a bug in CC https://github.com/Azure/cyclecloud-slurm/issues/215 Working on a workaround