LLNL / UnifyFS

UnifyFS: A file system for burst buffers
Other
99 stars 31 forks source link

Add resource manager support for Flux #798

Closed wangvsa closed 11 months ago

wangvsa commented 11 months ago

Description

Tioga uses Flux to schedule jobs and has limited support for srun. Also Tioga does not have the scontrol command, which we use to retrieve the allocated node list.

This PR includes native support for Flux. It uses flux run to run clients and servers and uses flux resource to retrieve the number of nodes and the node list. PS: flux resource returns a condensed node list, e.g., tioga[3-10, 12, 14]. The existing parse_hostfile() function can't handle this format, I added some code to parse it manually.

How Has This Been Tested?

Tested on Tioga with 1, 2, 4 nodes. Also tested unifyfs-ls, unifyfs-stage and stage-in/out features.

Types of changes

Checklist:

TODO

Unlike slurm where SLURM_JOBID can be used to determine a slurm allocation, flux only sets environment variables such as FLUX_JOB_ID for each flux run job (a flux job is similar to a slurm step). At the time of executing unifyfs (batch level), those variables have not been set yet.

A short flux script example:

#!/bin/bash
# flux: -N4 -n256 -t 5m
# flux: --job-name="UnifyFS"
# flux: --queue="pdebug"

export UNIFYFS_LOGIO_SPILL_DIR=/tmp
export UNIFYFS_LOG_DIR=`pwd`
export UNIFYFS_LOG_VERBOSITY=3
export UNIFYFS_MOUNTPOINT=/unifyfs

# Here FLUX_JOB_ID are FLUX_JOB_NNODES are not set

unifyfs start -d --stage-in=`pwd`/manifest-in.txt --share-dir=`pwd`
flux run -n128 -N4 -c1 $UNIFYFS_DIR/libexec/write-static -m /unifyfs -f myTestFile
flux run -n128 -N4 -c1 $UNIFYFS_DIR/libexec/write-gotcha -f workflowTestFile
flux run -n128 -N4 -c1 $UNIFYFS_DIR/libexec/read-gotcha  -f workflowTestFile
flux run -n16 -N4 -c1 unifyfs-stage --parallel --status-file=/tmp/stage-out-status.txt `pwd`/manifest-out.txt 
unifyfs-ls 
unifyfs terminate --cleanup

As a result, currently I use FLUX_EXEC_PATH to determine if the system has flux scheduler. I feel this is not optimal but I couldn't figure out a better way.

adammoody commented 11 months ago

Thanks, @wangvsa