JohnTheBlindMilkman / SlurmJobMonitor

MIT License
0 stars 0 forks source link

Slurm Job Monitor

Slurm Job Monitor (SJM) is a script-like program which sole purpose is to monitor jobs run on the SLURM scheduler at GSI. Currently the program only supports monitoring standard DST analysis at HADES. If you wish to expand this functionality submit a pull request (this will need a lot of restructurisaition of the code, just be warned).

Dependencies

This program utilises the following:

Instalation

Clone this repository and once youre inside the directory do:

mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make

and pray for successful compilation. If you wish to create documentation add -DSJM_ENABLE_DOXYGEN=ON flag before the .. (make sure you have doxygen installed on your machine).

If you want to use a debugger, because something is broken or you broke something, change the -DCMAKE_BUILD_TYPE=Release flag to Debug.

Usage

For the script to work it need access to the output folder where SLURM saves the log files of the finished jobs. Before that happens, i.e. while the jobs are being executed, the .log file exists as an .out file (contents are the same). These files are what the SJM looks for and where it takes the information from.

Currently there are two options for running this program:

In order to run the program execcute the binary file ./monintor with 3 mandatory arguments:

Important note: do not set the refresh rate too high, the program takes some time to read all the files (expecially with a lot of files). Moreover, if the refresh rate is too quick it messes up the remaing time calculation. As of the moment of speaking 30s seems to be the smallest number it will work with.

The SJM has three modes of printing the information:

Limitations

  1. Your job has to use the standard HTool analysis percentage print.
  2. Your jobs have to end with the standard "Finished DST analysis".
  3. SLURM output has to be all in one direcotry.

Final Note

Please do not overuse this script. It utilises a lot of practices whoch any IT specialist at a large batch farm would advise against.