JobTypes and Types layered relationships. Could it be more flexible?

victorsndvg commented 6 years ago

JobTypes can currently describe how the applications are called depending on the underlying (software) layers used for running them.

The involved layers we have identified are (based on our experience):

Workload manager (e.g. Slurm)
Process manager (e.g. Openmpi)
Container technology (e.g. Singularity)
Software call (e.g. echo hola)

Currently, workload managers types like Slurm or Torque describe how a batch script is submitted. In addition, a SingularityJob type describes how to call a singularity container inside workload manager. In particular, a SingularityJob type always use mpirun as a process manager, but I think is not always mandatory (e.g. a sequential singularity job). Finally, a particular software call is also expressed through a command job option.

In my opinion, the top (workload manager) and bottom ("software-call") layers of the hierarchy are mandatory. I think the other layers could be optional and interrelated in order to express (in a more flexible way) the requirements and execution mode of every application. As a suggestion, a contained_in relationship could express this kind of hierarchical structures.

As a(n) (simplified) example, the following options could be acceptable:

srun command
srun [mpirun] command
srun [singularity] command
srun [mpirun] [singularity] command

This flexibility could be expressed by a blueprint like the following example:

    computational_resources:
        type: hpc.nodes.WorkloadManager
        properties:
           ...

    mpi:
        type: hpc.nodes.MPIJob
        properties:
            job_options: 
                modules:
                    - openmpi
                    ...
                flags:
                    - '--mca orte_tmpdir_base /tmp '
                    - '--mca pmix_server_usock_connections 1'
        relationships:
            - type: contained_in  #job_managed_by_wm
              target: computational_resources

    container:
        type: hpc.nodes.SingularityJob
        properties:
            job_options: 
                modules:
                    - singularity
                    ...
        relationships:
            - type: contained_in 
              target: mpi

    software:
        type: hpc.nodes.ShellJob
        properties:
            job_options: 
                command: 'flow inputfile outputfile'
        relationships:
            - type: contained_in 
              target: container

Even if we though about other container solutions (docker), or single-node parallel jobs, we can interchange MPI+container-contained_in relationships, avoiding the MPI-vendor-and-version matching inside and outside the container in some cases:

srun [singularity | docker] [mpirun] command

To think

emepetres commented 5 years ago

@victorsndvg it is not clear to me how mpi type could exist standalone, I mean: How it would be used with an SRUN/SBATCH job, without container?

victorsndvg commented 5 years ago

Taking as a base point that workload manager and software are mandatory. What I mean is not the MPI library, but the MPI process manager. One can use mpirun, mpiexec, (from intelmpi, openmpi, mpich, etc.) or srun (from slurm) as process manager inside a sbatch script or not.

The orchestrator can take advantage of containers to provide portable workflows, but it can also use locally installed software from the HPC modules system.

A sequential job does not need to spawn processes (don't need mpirun). The following example illustrate it:

...
#SBATCH -N 1 
#SBATCH -n 1
#SBATCH --ntasks-per-node=1

singularity exec opm.simg flow input output

A parallel MPI job needs to be called by a process manager (e.g. mpirun). The tool to run can be containerized or not. The following example illustrate it:

...
#SBATCH -N 2
#SBATCH -n 48
#SBATCH --ntasks-per-node=24

module load gcc openmpi paraview
mpirun paraview --serve inputdata

In addition, other future important point is the support of other container technologies. Usage of mpirun within a docker container is different. With singularity: mpirun singularity ... With docker: docker run mpirun ...

emepetres commented 5 years ago

how mpi flags woud be attached to a sbatch script?

victorsndvg commented 5 years ago

Really don't know if I understand the question. I figure out you mean in the blueprint example of the first post.

This example:

    mpi:
        type: hpc.nodes.MPIJob
        properties:
            job_options: 
                modules:
                    - openmpi
                    ...
                flags:
                    - '--mca orte_tmpdir_base /tmp '
                    - '--mca pmix_server_usock_connections 1'
        relationships:
            - type: contained_in  #job_managed_by_wm
              target: computational_resources

Should be translated like this (e.g inside a sbatch script):

module load openmpi
mpirun --mca orte_tmpdir_base /tmp --mca pmix_server_usock_connections 1 ...

This particular example avoids the dependence of the MPI containers with the /scratch directory at FT2

Instead of adding generic flags job_option, it can be mapped to other options like tmpdir, etc. But every vendor has its own flags and it is a hard work to identify and map all of them

MSO4SC / cloudify-hpc-plugin

JobTypes and Types layered relationships. Could it be more flexible? #74