geodesymiami / rsmas_insar

RSMAS InSAR code
https://rsmas-insar.readthedocs.io/
GNU General Public License v3.0
59 stars 23 forks source link

sbatch_conditional.bash #459

Closed mirzaees closed 3 years ago

mirzaees commented 3 years ago

Hi @Ovec8hkin

I am making this issue to discuss about how we need sbatch_conditional.bash

you have seen our regular run files in the run_files folder, each having several tasks. we are making jobs for these run files (batch files) and then submit with your submit_jobs.bash. Currently it is working great and takes care of everything including resubmit after failure, number of active jobs and so on.

but the problem is that, there are several things hard coded which makes it work only for this setup of folders and jobs. even the run files names are hard coded. I see for example 'step_io_load_list' in submit_jobs.bash.

What I need, is some added capability to work with a single new batch file containing several tasks. the name of the batch file could be arbitrary. we need to be able to submit jobs corresponding to a batch file. Those jobs are written before (meaning that, there is no need to think about memory, walltime, ...) and ready to be submitted

For example: I have a run file named: run_arbitrary_name

There are multiple jobs created for this as:

run_arbitrary_name_0.job 
run_arbitrary_name_1.job
run_arbitrary_name_2.job 
...

we need a script like sbatch_conditional.bash or generally submit_jobs.bash itself to be able to run these jobs. The input options would be as follow:

sbatch_conditional.bash --pattern run_arbitrary_name --step_max_tasks 1000 --total_max_tasks 3000

step_max_tasks and total_max_tasks should have a default and be optional the script looks for the jobs with the given pattern, submits them and waits for them to finish. All the checking for number of tasks, failures, ... would be same as before

You have written similar function in job_submission.py if I call job_submission.py with a batch file, it can submit it as one or several jobs depending on the number of tasks the only difference here would be that I don't want you to write jobs, only find them and submit them

Also, the working directory should be where the run_arbitrary_name (pattern) exists. not depending on $SCRATCH or $PROJECTNAME

falkamelung commented 3 years ago

The step_io_load_list assignment then would be in submit_jobs.bash, I suppose.

Please also implement reading of the io_load form `job_defaults.cfg. For now lets have both, reading from job_defaults.cfg and hardwired. Once everything works I will remove the hardwired lines.

------------------------------------------------------------------------------------------------------
name                               c_walltime  s_walltime  c_memory s_memory  num_threads   io_load
-----------------------------------------------------------------------------------------------------
default                              02:00:00        0       3000       0         2            1

# topsStack

unpack_topo_reference                    0      00:01:00     4000       0         8           0.2
unpack_secondary_slc                     0      00:00:10     4000       0         2            1
average_baseline                         0      00:00:10     1000       0         2            1
extract_burst_overlaps                   0      00:00:10     4000       0         2            1

With this, in the final minsar package all parameters will be specified in 3 files: platforms_defaults.cfg, queues.cfg and job_defaults.cfg. I would suggest to create a reader read_config.bash (the code is already utils/read_platform_defaults.bash, and use in all scripts (bash and python) caps for the variables that get assigned in *cfg files (e.g. MAX_JOBS_PER_QUEUE and TOTAL_MAX_TASKS)

read_config.bash should skip assignment if it exists as environment variable. That allows to try with a different values without modifying a *cfg file.

(in job_submisson.py we use get_config_defaults , and suggestion for a common name? )

 cat ${HOME}/accounts/suggestion_platforms_defaults.cfg
###################################################################################################
echo "exporting environment variables using ~/accounts/platforms_defaults.cfg ..."
###################################################################################################
# set environment variables. For example for PLATFORM_NAME stampede2 do `export JOBSCHEDULER=SLURM`
###################################################################################################
PLATFORM_NAME JOBSCHEDULER QUEUENAME     JOB_SUBMISSION_SCHEME      JOBSHEDULER_PROJECTNAME  SCRATCHDIR                              WORKDIR
stampede2        SLURM     skx-normal launcher_multiTask_singleNode     TG-EAR200012         ${SCRATCH}                             ~/insarlab
frontera         SLURM       normal   launcher_multiTask_singleNode       EAR20013           ${SCRATCH}                             ~/insarlab
comet            SLURM       compute           singleTask                 EAR20013        /oasis/scratch/comet/$USER/temp_project   ~/insarlab
deqing_server    PBS         batch             singleTask               TG-EAR180012         ${SCRATCH}                             ~/insarlab
eos              PBS         batch             singleTask                    NONE         /scratch/insarlab/${USER_PREFERRED}       ~/insarlab
jetstream        NONE        NONE                 NONE                       NONE           /data/HDF5EOS                           ~/insarlab
mac              NONE        NONE                 NONE                       NONE          ~/insarlac/scratch                       ~/insarlab
cat minsar/defaults/queues.cfg 
PLATFORM_NAME  QUEUENAME     CPUS_PER_NODE  THREADS_PER_CORE  MEM_PER_NODE  MAX_JOBS_PER_WORKFLOW  MAX_JOBS_PER_QUEUE  WALLTIME_FACTOR
stampede2      skx-normal         48                 2             192000            12                    25                  1
stampede2      skx-dev            48                 2             192000            1                     25                  1
stampede2      normal             48                 4             96000             50                    25                  1
stampede2      development        48                 4             96000             1                     25                  1
frontera       normal             56                 1             192000            12                   100                  1
frontera       development        56                 1             192000            1                    100                  1
frontera       flex               56                 1             192000            12                   100                  1
frontera       nvdimm             48                 1             2100000           8                    100                  1
Ovec8hkin commented 3 years ago

The step_io_load_list assignment then would be in submit_jobs.bash, I suppose.

Please also implement reading of the io_load form `job_defaults.cfg. For now lets have both, reading from job_defaults.cfg and hardwired. Once everything works I will remove the hardwired lines.

------------------------------------------------------------------------------------------------------
name                               c_walltime  s_walltime  c_memory s_memory  num_threads   io_load
-----------------------------------------------------------------------------------------------------
default                              02:00:00        0       3000       0         2            1

# topsStack

unpack_topo_reference                    0      00:01:00     4000       0         8           0.2
unpack_secondary_slc                     0      00:00:10     4000       0         2            1
average_baseline                         0      00:00:10     1000       0         2            1
extract_burst_overlaps                   0      00:00:10     4000       0         2            1

Make a new issue please. Stop adding unrelated tasks to current issues.

Ovec8hkin commented 3 years ago

@mirzaees I wrote a generalized function that I just committed. Try it out. It just a submission script, as I can't really generalize the "wait until finished" functionality. If that is important for your purposes, you can write a wrapper around the new sbatch_conditional function that does what you want.

mirzaees commented 3 years ago

Hi @Ovec8hkin , that works nicely, thank you!

I just don't know why you kept 'run_01' as an argument, we could use step name itself to find patterns. then if a job name starts with 'run_01' , the command would be: sbatch_conditional.bash --step_name run_01_unpack_topo_reference

Ovec8hkin commented 3 years ago

Its because of how Falk has defined the step names for submitjobs. Because they are independent of the "run*_" notation at the beginning, you need to pass the step name separately to properly lookup tasks using the same step name (since Falk claims that step names don't always run in the same order depending on the workflow). In general, if you don't have separate naming convention for step names, you won't need to use the step_name option, and can just run: sbatch_conditional.bash run_01_unpack_topo_reference.

mirzaees commented 3 years ago

That is right, thank you!

falkamelung commented 3 years ago

I have not understood yet either. I would have expected to submit one job as sbatch_conditional.bash run_01_unpack_topo_reference_0.job or as sbatch_conditional.bash run_02_*_1.job. I suspect there is a reason for that?

An alternative would be to submit multiple jobs as sbatch_conditional.bash run_02_*.job I don't remember weather we decided anything about that.

Josh, will you be availbale in the afternoon tomorrow? We should meet. Else I will try to get with Sara on the same page (afternoon, Sara),

Ovec8hkin commented 3 years ago

@falkamelung The syntax you're using above is incorrect. You don't pass a glob of files to sbatch conditional, just a run file pattern ie "run_02" (distinct from "run_02*"). The script handles finding the proper files. The extra --step_name parameter is for manually defining the name of the processing step being used. For most cases, the --setp_name is not necsarry and the script will default back to using the initially provided pattern. But for submit_jobs.bash we have to pass manual --step_names due to the naming conventions.

falkamelung commented 3 years ago

done