aws-samples / awsome-distributed-training

Collection of best practices, reference architectures, model training examples and utilities to train large models on AWS.
MIT No Attribution
203 stars 85 forks source link

Change slurm exporter to Slinky slurm exporter #492

Open mhuguesaws opened 1 week ago

mhuguesaws commented 1 week ago

Schedmd release slurm exporter support in Slinky project https://github.com/SlinkyProject/slurm-exporter

nghtm commented 1 week ago

The Makefile currently references a helm resource here, which causes failure to build on Slurm controller node. If Helm is a dependency for this project, it still needs to be evaulated as to whether this project is suitable to be installed on a Slurm Controller Node

mhuguesaws commented 1 week ago

One can use make run to run on the localhost. You will need to make sure you have slurm REST API install You can do so by running https://github.com/aws-samples/aws-parallelcluster-post-install-scripts/blob/main/rest-api/postinstall.sh to install and configure Slurm restd.

Once done, get a token export METRICS_TOKEN=scontrol token | cut -d '=' -f 2 Then run the exporter make run or go run ./cmd/main.go