Change naming of output and error files of HPC jobs

marangiop commented 3 years ago

Is your feature request related to a problem? Please describe. Croupier/Cloudify atuomatically assigns a random ID to each job after a given job has started being executed through the run_jobs workflow. This is visible under the "Deployment Outputs/Capabilities" tab when you click on a Deployment. This ID is made up by the string "atos" followed by a combination of 6 random letters and numbers.

This become problematic when you want to inspect the .err and .out files associated with each job of the workflow. As we know, the install workflow creates a directory for a given workflow inside the target HPC cluster. When the run_job workflow is started, that directory is populated with a .script file for each job.


#!/bin/bash -l

#SBATCH -N 1
#SBATCH -n 1
#SBATCH --ntasks-per-node=1
#SBATCH -t 00:40:00
#SBATCH -p thinnodes
#SBATCH -e atos57xwq9.err
#SBATCH -o atos57xwq9.out

# DYNAMIC VARIABLES

cd $CURRENT_WORKDIR

bash <(sed -n '10,$p' $STORE/a3_climate_model_workflow/SEASONAL_SCRIPTS/CFS_SCRIPTS/download_CFSR.in) 00

As each job is executed using the respective .script file created by Croupier inside the HPC system, the information generated during the execution of a given job is logged into two files, a .err and .out file (see SBATCH -e and -o flags inside .script file).These files are automatically named based on the ID that was assigned to them by Croupier as I explained above. If your workflow contains several jobs, then inspecting the log files become quite painful, because it's not clear from the name of the .err and .out files to what job inside the blueprint they belong to.

Describe the solution you'd like The solution is that Croupier assigns a sensible name to the .err and .out file. This could be the name of the individual job as written inside the blueprint.

For example, if the job inside the blueprint is named "job_1", then the .err and .out files created inside HPC should be job_1.err and job_1.out.

jramosrivas commented 3 years ago

The name of the error and output files is controlled by the #SBATCH -e and #SBATCH -o options in the launch script, those options are set on croupier here

As you can see, it uses the job_name by deafult if stderr_file or stdout_file options are not present in job_options.

There are 2 possible solutions:

Change the default name assigned to the job by croupier.
Leave it to the user to use a different error and output filename by setting the appropriate job option on their deployment inputs definition.

jramosrivas commented 3 years ago

To implement option1, change this line

marangiop commented 3 years ago

Thank you for the suggestions! @jramosrivas

I did a quick debugging in PyCharm and I can see that the second part of job name (instance_components[-1]) is based on a variable instance_components that collects the output of the command instance.id.split('_'). But as you can see, the content of of instance.id is set somewhere else, before we reach this line inside workflows.py

marangiop commented 3 years ago

But yes, you are right. We need a way to inject in line 75 that self.name (i.e. the job name) should have the same name as the one stated in the blueprint yaml file

marangiop commented 3 years ago

Solved.

self.name = '_'.join(instance_components[:-1])

gives the same job name as written in the blueprint yaml file

marangiop commented 3 years ago

This seems to be working with a local tox test, but not from Cloudify GUI (after uploading the new croupier .wgn file containing the change in line 75 of workflows.py)

marangiop commented 3 years ago

Solved by introducing the change in permedcoe branch

ari-apc-lab / croupier

Change naming of output and error files of HPC jobs #1