geodesymiami / rsmas_insar

RSMAS InSAR code
https://rsmas-insar.readthedocs.io/
GNU General Public License v3.0
58 stars 22 forks source link

submission options #232

Closed falkamelung closed 4 years ago

falkamelung commented 4 years ago

JOB_SUBMISSION_SCHEME= RUNFILETYPE_METHOD_SCHEDULER_QUEUENAME

JOB_SUBMISSION_SCHEME=singlefile_serial_LSF_general             #  pegasus default
JOB_SUBMISSION_SCHEME=singlefile_parallel_LSF_general
JOB_SUBMISSION_SCHEME=singlefile_parallel_LSF_parallel
JOB_SUBMISSION_SCHEME=multifile_parallel_LSF_parallel            # triton default
JOB_SUBMISSION_SCHEME=singlefile_launcher_LSF_parallel                   
JOB_SUBMISSION_SCHEME=singlefile_launcher_SLURM_skx-normal      #  stampede2 default
JOB_SUBMISSION_SCHEME=multifile_launcher_SLURM_skx-normal
JOB_SUBMISSION_SCHEME=singlefile_serial_SLURM_shared                   # Comet optional
JOB_SUBMISSION_SCHEME=singlefile_parallel_SLURM_compute            # Comet default

JOB_SUBMISSION_SCHEME=singlefile_serial_PBS_general              # deqing default

(JOB_SUBMISSION_SCHEME=multifilemultiscratch_parallel_SLURM_large-shared           # comet 
use local SSD as scratch (comet has 1.4 TB SSDs, not sure queue name is correct), when all jobs completed move merged  dir to common scratch). 

Use (for discussion):

NUMBER_OF_CORES_PER_NODE                # hardware, specify in platforms.bash
NUMBER_OF_THREADS_PER_CORE
NUMBER_OF_SIMULTANEOUS_TASKS          # becuaus of IO concerns
MAX_NUMBER_OF_SIMULTANEOUS_TASKS = 1000
MAX_NUMBER_OF_NODES  = 5                    # on small systems we can't get many nodes
MAX_WALLTIME = 2:00:00                            #  On stampede there is a limit for teh development queue
HYPER_THREADING=True, False                  # if False OMP_NUM_THREADS=1 everywhere. If True:  use OMP_NUM_THREADS from job_defaults.cfg and 

NUMBER_OF_NODES_PER_MULTI-FILE-JOB       # to be optimized on each platform 
NUMBER_OF_NODES_PER_SINGE-FILE-JOB       # stampede default: as much as you need (not applicable for METHOD=serial

submit_launcher_jobs (currently: submit_with_launcher) submit_serial_jobs (currently: submit_jobs_individually) submit_parallel_jobs

* Each function has two parts: one writing the job file(s) and one doing the job submission and waiting that the job(s) have completed. The second part remains unchanged. For the first part:

write_job_file ( run_file_type=singlefile, method= serial)
(currently we have for method=serial write_batch_job_files and for method=launcher the lines of submit_with_launcher until 'job_number = submit_single_job...' )

* `write_job_file` has two parts. One to get the scheduler-specific lines, and one to get the job lines

get_job_header_lines() get_job_file_lines(run_file_list)

These functions should use the environment variables or we add them all to the PARAMS variable (preferred) (PARAMS.method, PARAMS.number_of_node_per_code). For len(run_file_list)==1  we deal with one runfile (singlefile as currently).  

* Disk Space issues

We also need the ability to NOT initiate processing is there is not enough space, ie. to calculate needed disc space and if disk_space_required > SCRATCH_DISK_SPACE not to start the job, and request to process less images and/or increase number of looks
SCRATCH_DISK_SPACE
DISK_SPACE_PER_BURST

disk_space_required=number_images * number_connections * DISK_SPACE_PER_BURST * number_bursts

** Other useful info:
https://www.chpc.utah.edu/documentation/software/serial-jobs.php
https://support.nesi.org.nz/hc/en-gb/articles/360000690275-Parallel-Execution
https://waterprogramming.wordpress.com/2015/04/06/pbs-batch-serial-job-submission-with-slot-limit/
https://www.wm.edu/offices/it/services/hpc/using/jobs/multi-serial/index.php

* Preferably job submission is done fully behind the scenes, i.e. job_submission.py reads the job_submission-related environment variables and creates the jobs scripts accordingly. I think it also should read the job_defaults.cfg, which are overwritten by command line arguments.

* We used to have another way how to submit jobs on pegasus. Here the code I found:

cat /nethome/sxh733/development/falk/rsmas_insar/sources/roipac/INT_SCR/split_jobs.py cat /nethome/sxh733/development/falkold_keep/rsmas_insar/sources/roipac/INT_SCR/split_igramJobs_filt.py cat /nethome/sxh733/development/rsmas_insar/sources/roipac/INT_SCR/runJobs.py

falkamelung commented 4 years ago

I tried the parallel queue. It works fine! (see below) The jobs need an ‘&' at the end of the line.

So I suggest to check for the existence of the environment variable below and do the sixteen-jobs-per-node only if the variable exists, as indicated by NEW below. Once this is confirmed to be working we change the rest, which obviously a more significant step.

If we do it this way, it can be in master without affecting anything (unless the variable is set)

-rw-rw--w-+ 1 dwg11         0 Mar  8 13:29 run_9_16parallel_parallel_23348462.e
-rw-rw--w-+ 1 dwg11     71601 Mar  8 13:29 run_9_16parallel_parallel_23348462.o
-rw-rw--w-+ 1 dwg11         0 Mar  8 14:16 run_9_5parallel_parallel_23348482.e
-rw-rw--w-+ 1 dwg11      6283 Mar  8 14:16 run_9_5parallel_parallel_23348482.o
-rw-rw--w-+ 1 dwg11      2266 Mar  8 14:42 run_9_10parallel_parallel.job
-rw-rw--w-+ 1 dwg11      2285 Mar  8 14:46 run_9_16parallel_parallel.job
-rw-rw--w-+ 1 dwg11         0 Mar  8 17:57 run_9_10parallel_parallel_23348605.e
-rw-rw--w-+ 1 dwg11     46207 Mar  8 17:57 run_9_10parallel_parallel_23348605.o
-rw-rw--w-+ 1 dwg11         0 Mar  8 19:41 run_9_16parallel_parallel_allthreads_.23348606.e
-rw-rw--w-+ 1 dwg11     71638 Mar  8 19:41 run_9_16parallel_parallel_allthreads_23348606.o
      if os.getenv('JOBSCHEDULER') in ['SLURM', 'sge']:

            submit_job_with_launcher(batch_file=batch_file, out_dir=out_dir, memory=maxmemory, walltime=wall_time,
                                     number_of_threads=num_threads, queue=queue)
        else:

NEW           if os.getenv('JOB_SUBMISSION_SCHEME'):
NEW               jobs = submit_sixteen_jobs_per_node(batch_file=batch_file, out_dir=out_dir,
NEW           else:
            jobs = submit_jobs_individually(batch_file=batch_file, out_dir=out_dir, memory=maxmemory,
                                            walltime=wall_time, queue=queue)
falkamelung commented 4 years ago

Minor other issues to fix in job_submission.py On stampede (probably also pegasus) it give the same message whether a job is pending or running, It would be preferred to say ...job run_2_unpack_slave_slc.job pending before it is running.

run_2_unpack_slave_slc.job submitted as SLURM job #5393944
Waiting for job run_2_unpack_slave_slc.job output file after 0 minutes
Waiting for job run_2_unpack_slave_slc.job output file after 1.0 minutes
Waiting for job run_2_unpack_slave_slc.job output file after 2.0 minutes
Waiting for job run_2_unpack_slave_slc.job output file after 3.0 minutes
falkamelung commented 4 years ago

Answer from COMET Helpdesk. He talks about a queue with a 1.4 TB SSD disk. This is probably very fast, but after processing, different /merged dirs need to be combined to a common scratch.

Thanks for the info. I think 240 should be ok depending on the IO pattern (if its big block reads we should be fine, if its a lot of small block IO we might see issues). I think we can try it out and watch the load. Maybe do it on Monday so I can ask our storage folks to keep an eye out.  

So from what you note the requirement is really 10TB? That is right at our per-user limit. I can set your limit slightly higher so you can do one case comfortably. You mentioned job sizes 3-4 times bigger. Is that compared to the 10TB (i.e are we talking 40TB?). That might be an issue to handle on the IO space side (I would have to get people to clean up a lot to create space and its certainly not comfortable at present). 

We do control allocations on Comet to keep wait times low. Our typical expansion factors are below 1.5 so my guess is if you ask for 10 nodes for 2 hours the wait time might be a few hours (I would expect it to be less than a day). 

In short, I think if you need 10TB and can run with 240 cores, we can give it a shot and see how things play out. If you need more like 40TB, we definitely need to plan things given how full the filesystem is running right now.

Mahidhar

p.s: All of this might change in the short run if we get a lot of COVID-19 research related simulations. As you may have seen XSEDE resources are being offered for urgent simulations. I have not seen large requests yet but that could change. This might be at all sites though.

On 2020-03-21 19:12:02, famelung@rsmas.miami.edu wrote:
Hi  Mahidar,
The ‘wait’ at the end does the job. Thank you! It would be more than 5
  tasks. It would be e.g. 240 tasks distributed on 10 nodes.

I hear you regarding the 1.4 TB SSDs. It would require significant
  changes to our code. And I am not sure it  would work. So this is
  not an option for now.

1. A typical job downloads 2TB of data and writes 8TB of temporary
  files. The final result is only 10-50 GB. We would like to run jobs
  3-4 times bigger.   Is there enough space on the Lustre filesystem?

2. How many IO-heavy jobs could I run simultaneously?  Would 240 work?
  I could limit the number of simultaneous tasks. This issue could be
  a reason to stay on Stampede2 where a few hundreds are no problem.

3. For now I am trying to find out whether it is worth to move to your
  system. Maybe you can give me some ideas about pending times? If I
  ask for 5nodes-4hours, 10nodes-2hours and 20nodes-1hour  (hours is
  walltime),  long would average pending times be? On Stampede it
  takes forever (a day, occasionally, and  as one workflow consists
  of multiple jobs it may take several days), I see on XSEDE that
  Comet is at less capacity that is why I am trying.

4. In the shared queue I would be sharing nodes with other users,
  correct? This is an option but on the Miami HPC system it
  occasionally causes problems.

We install our own python which works for me on Comet.

Thank you!
Falk

On Mar 21, 2020, at 6:50 PM, Mahidhar Tatineni via RT
  <help@xsede.org<mailto:help@xsede.org>> wrote:

Sat Mar 21 17:50:56 2020: Request 132476 was acted upon.
Transaction: Correspondence added by mahidhar
     Queue: 0-SDSC
   Subject: Comet - scratch disk and job submission questions
     Owner: mahidhar
Requestors: famelung@rsmas.miami.edu<mailto:famelung@rsmas.miami.edu>
    Status: open
Ticket URL:
  https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fportal.xsede.org%2Fgroup%2Fxup%2Ftickets%2F-
  %2Ftickets%2F132476&amp;data=02%7C01%7Cfamelung%40rsmas.miami.edu%7Cd9706d6aabf34919ee3d08d7cdea5413%7C2a144b72f23942d48c0e6f0f17c48e33%7C0%7C0%7C637204278649727949&amp;sdata=YBzHK4LZhgclkWcKSUOI3IFpCcYBj6%2BlJZCZbJDu798%3D&amp;reserved=0

Falk,

A few things to note (details on the queues are in our user guide):

[1] We have a "shared" partition that allows you to ask for *one* core
  specifically and it will land on a shared node. So one thing you
  can do is just submit 200 jobs (or an array job with 200 tasks) to
  this partition. So you would change:

#SBATCH -p shared
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1

and then just have the serial job in the script.

[2] If you want to bundle on the "compute" partition, note that you
  are charged the whole node whether you use all 24 cores or not
  (unlike the "shared" partition where you are only charged for the
  cores requested).  So your second test would have been charged for
  24 cores even though you are only using 5 tasks.

[3] I think in your second test, you need a "wait" statement at the
  end. Otherwise, the script will background all the tasks and just
  exit. Also, does your code bind to specific cores or will it just
  float and pick up any free core. If there is some binding going on,
  you will need to make sure you are picking up free cores and not
  binding everything to core 0.

Going back to the inputs - what is the size of your largest dataset?
  When you mentioned TBs of data - is that for one of your datasets
  or the total for multiple datasets you anticipate? I ask because if
  you have 200 jobs using the same dataset and if its smaller than
  1.4TB we can still use the local SSD (in fact it is preferred). We
  have about a rack of nodes that have 1.4TB of SSD per node. You can
  have a tarred copy in Lustre and then untar into local scratch on a
  node and just have all 24 tasks use it from there.

Mahidhar

On 2020-03-21 17:17:39,
  famelung@rsmas.miami.edu<mailto:famelung@rsmas.miami.edu> wrote:
Hello Mahidar,
Thank you soo much for you quick response.  Re [2], unfortunately yes.
 Copying over to a local SSD is not really possible. Re [1]: I need
 the data just very short time, until one workflow is finished.
 Only, we don’t have good cleaning scripts. I probably need to write
 them. So for testing using
 /oasis/scratch/comet/famelung/temp_project seems the easiest.

But I don’t get my jobs to run. We have serial jobs and would like to
 run at least 200 of them at the same time. I tried two things. (1)
 A launcher job as we run on stampede2. Unfortunately it throws
 errors. (2) a simple 1-node job with 5 tasks. It did not produce
 results.  For the latter, if I don’t add an ‘&’ at the end they run
 but one after the other.

I must be doing something wrong. Any suggestions are appreciated.
Thank you
Falk

#################
1. Launcher job plus  error messages
#################
//comet-
 ln2/oasis/scratch/comet/famelung/temp_project/unittestGalapagosSenDT128/run_files[457]
 cat run_1_unpack_slc_topo_master.job
#! /bin/bash
#SBATCH -J run_1_unpack_slc_topo_master
#SBATCH -A TG-EAR180014
#SBATCH --mail-user=famelung@rsmas.miami.edu<mailto:mail-
  user=famelung@rsmas.miami.edu><mailto:mail-
 user=famelung@rsmas.miami.edu<mailto:user=famelung@rsmas.miami.edu>>
#SBATCH --mail-type=fail
#SBATCH -N 1
#SBATCH -n 6
#SBATCH -o
 /oasis/scratch/comet/famelung/temp_project/unittestGalapagosSenDT128/run_files/run_1_unpack_slc_topo_master_%J.o
#SBATCH -e
 /oasis/scratch/comet/famelung/temp_project/unittestGalapagosSenDT128/run_files/run_1_unpack_slc_topo_master_%J.e
#SBATCH -p compute
#SBATCH -t 02:00:00

module load launcher
#export OMP_NUM_THREADS=16
export
 LAUNCHER_WORKDIR=/oasis/scratch/comet/famelung/temp_project/unittestGalapagosSenDT128/run_files
export
 LAUNCHER_JOB_FILE=/oasis/scratch/comet/famelung/temp_project/unittestGalapagosSenDT128/run_files/run_1_unpack_slc_topo_master

$LAUNCHER_DIR/paramrun
//comet-
 ln2/oasis/scratch/comet/famelung/temp_project/unittestGalapagosSenDT128/run_files[458]
 cat run_1_unpack_slc_topo_master_32216411.o
Launcher: Setup complete.

------------- SUMMARY ---------------
 Number of hosts:    1
 Working directory:
 /oasis/scratch/comet/famelung/temp_project/unittestGalapagosSenDT128/run_files
 Processes per host: 6
 Total processes:    6
 Total jobs:         5
 Scheduling method:  block

-------------------------------------
Launcher: Starting parallel tasks...
//comet-
 ln2/oasis/scratch/comet/famelung/temp_project/unittestGalapagosSenDT128/run_files[459]
 head run_1_unpack_slc_topo_master_32216411.e
ModuleCmd_Load.c(213):ERROR:105: Unable to locate a modulefile for
 'launcher'
using /tmp/launcher.32216411.hostlist.GU4Isct0 to get hosts
starting job on comet-01-46
ssh: threads.c:353: krb5int_key_register: Assertion
 `destructors_set[keynum] == 0' failed.
/home/famelung/test/operations/rsmas_insar/3rdparty/launcher/paramrun:
 line 4: 24338 Aborted                 (core dumped) "$@"
ssh: threads.c:353: krb5int_key_register: Assertion
 `destructors_set[keynum] == 0' failed.
/home/famelung/test/operations/rsmas_insar/3rdparty/launcher/paramrun:
 line 4: 24349 Aborted                 (core dumped) "$@"
ssh: threads.c:353: krb5int_key_register: Assertion
 `destructors_set[keynum] == 0' failed.
/home/famelung/test/operations/rsmas_insar/3rdparty/launcher/paramrun:
 line 4: 24356 Aborted                 (core dumped) "$@"
ssh: threads.c:353: krb5int_key_register: Assertion
 `destructors_set[keynum] == 0' failed.

#################
Regular job that runs very shortly  but does not produce anything (it
 does run well without ‘&’ but one-task-after-other )
#################

cat q1.job
#! /bin/bash
#SBATCH -J run_2_average_baseline
#SBATCH -A TG-EAR180014
#SBATCH --mail-user=famelung@rsmas.miami.edu<mailto:mail-
  user=famelung@rsmas.miami.edu><mailto:mail-
 user=famelung@rsmas.miami.edu<mailto:user=famelung@rsmas.miami.edu>>
#SBATCH --mail-type=fail
#SBATCH -N 1
#SBATCH -n 5
#SBATCH -o
 /oasis/scratch/comet/famelung/temp_project/unittestGalapagosSenDT128/run_files/run_2_average_baseline_%J.o
#SBATCH -e
 /oasis/scratch/comet/famelung/temp_project/unittestGalapagosSenDT128/run_files/run_2_average_baseline_%J.e
#SBATCH -p compute
#SBATCH -t 00:06:00

SentinelWrapper.py -c
 /oasis/scratch/comet/famelung/temp_project/unittestGalapagosSenDT128/configs/config_slave_20160629
 &
SentinelWrapper.py -c
 /oasis/scratch/comet/famelung/temp_project/unittestGalapagosSenDT128/configs/config_slave_20160723
 &
SentinelWrapper.py -c
 /oasis/scratch/comet/famelung/temp_project/unittestGalapagosSenDT128/configs/config_slave_20160804
 &
SentinelWrapper.py -c
 /oasis/scratch/comet/famelung/temp_project/unittestGalapagosSenDT128/configs/config_slave_20160816
 &
SentinelWrapper.py -c
 /oasis/scratch/comet/famelung/temp_project/unittestGalapagosSenDT128/configs/config_slave_20160828
 &

//comet-
 ln2/oasis/scratch/comet/famelung/temp_project/unittestGalapagosSenDT128/run_files[471]
 sbatch < q1.job
Submitted batch job 32216795
//comet-
 ln2/oasis/scratch/comet/famelung/temp_project/unittestGalapagosSenDT128/run_files[472]
 squeue -u ${USER}
           JOBID PARTITION     NAME     USER ST       TIME  NODES
 NODELIST(REASON)
        32216795   compute run_2_av famelung  R       0:04      1
 comet-11-33

On Mar 19, 2020, at 9:17 PM, Mahidhar Tatineni via RT
 <help@xsede.org<mailto:help@xsede.org><mailto:help@xsede.org>>
  wrote:

Thu Mar 19 20:17:08 2020: Request 132476 was acted upon.
Transaction: Correspondence added by mahidhar
    Queue: 0-SDSC
  Subject: Comet - scratch disk and job submission questions
    Owner: mahidhar
Requestors:
  famelung@rsmas.miami.edu<mailto:famelung@rsmas.miami.edu><mailto:famelung@rsmas.miami.edu>
   Status: open
Ticket URL:
 https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fportal.xsede.org%2Fgroup%2Fxup%2Ftickets%2F-
 %2Ftickets%2F132476&amp;data=02%7C01%7Cfamelung%40rsmas.miami.edu<https://nam01.safelinks.protection.outlook.com/?url=http%3A%2F%2F40rsmas.miami.edu&amp;data=02%7C01%7Cfamelung%40rsmas.miami.edu%7Cf29cebe8e97341db90fd08d7ce0c9080%7C2a144b72f23942d48c0e6f0f17c48e33%7C0%7C0%7C637204425682454514&amp;sdata=%2FiVtDGl92SPRSiHanUFt87phYrkxyclYqhLllmoZmus%3D&amp;reserved=0>%7Cec632e3476c4429c940208d7cc6c6a16%7C2a144b72f23942d48c0e6f0f17c48e33%7C0%7C0%7C637202638360530954&amp;sdata=YoOImmtZcblxDud8YLfKO7w6AU0AepU2ctayZKgL6cA%3D&amp;reserved=0

Falk,

Answers:

[1] We don't auto-purge this location but since the filesystem is
 quite full we do tell users to periodically clean up. How long do
 you anticipate needing the data? If its a few weeks to a month it
 should be ok (ideally we want clean up after a maximum of 3 months)

[2] Does every serial job need access to all the data? If its a subset
 of the dataset, I suggest copying to local SSD on each node. If you
 can do this, you should be able to scale up pretty easily. If not,
 1000 will be a problem I think. I would suggest doing fewer nodes
 but for longer duration.

[3] Depending on the answer to [2], we might be able to run in the
 "shared" partition instead of using launcher. Either option is ok.
 Depending on the IO load we might have to choose the smaller number
 of nodes with longer run times.

Mahidhar

On 2020-03-19 19:58:51,
 famelung@rsmas.miami.edu<mailto:famelung@rsmas.miami.edu><mailto:famelung@rsmas.miami.edu>
  wrote:
Hi,
I just started to examine whether Comet is appropriate for our data
processing jobs. We need a few TB of scratch space for our jobs, in
part for raw data which are downloaded at the beginning  of a the
workflow. I was thinking to use
/oasis/scratch/comet/$USER/temp_project

Here a couple of questions:
1. Will this disk be purged, how frequently, and is disk usage free or
charged to my account?
2. Our data processing consists of IO-intensive serial jobs. How many
processes can safely access this t disk at the same time? (For
Stampede I was told about 1000 simultaneous processes are fine,,
which is a surprisingly high number. )
3. Our processing workflow consists of 10 steps, each consisting of,
say, 240 serial jobs each requiring 30 minutes walltime. A new
processing step is started when the previous step is completed (I
plan to use launcher for serial job submission which we use on
Stampede). For each step we could request 10 nodes and 30 minutes
walltime, 5 nodes with 60 minutes walltime or 2 nodes with 150
minutes walltime (run in 120 jobs in a sequential manner on each
node). What would you suggest? Asking for a large number of nodes
with short walltime or for a small number of nodes but longer wall
times?
Thank you
Falk
Ovec8hkin commented 4 years ago
JOB_SUBMISSION_SCHEME=singlefile_serial_LSF_general             #  pegasus default
JOB_SUBMISSION_SCHEME=singlefile_parallel_LSF_general
JOB_SUBMISSION_SCHEME=singlefile_parallel_LSF_parallel
JOB_SUBMISSION_SCHEME=multifile_parallel_LSF_parallel            # triton default
JOB_SUBMISSION_SCHEME=singlefile_launcher_LSF_parallel                   
JOB_SUBMISSION_SCHEME=singlefile_launcher_SLURM_skx-normal      #  stampede2 default
JOB_SUBMISSION_SCHEME=multifile_launcher_SLURM_skx-normal
JOB_SUBMISSION_SCHEME=singlefile_serial_SLURM_shared                   # Comet optional
JOB_SUBMISSION_SCHEME=singlefile_parallel_SLURM_compute            # Comet default

JOB_SUBMISSION_SCHEME=singlefile_serial_PBS_general              # deqing

Speaking of the above syntax, I would break this up to be more readable. I would specify the following distinct variables:

CLUSTER_TYPE = SLURM/LSF/PBS
LAUNCHER = serial/parallel/launcher
QUEUE = #whatever queue is being used
NUM_FILES = singlefile/multifile

These could be specified in the platforms_defaults.bash file for the appropriate systems.

falkamelung commented 4 years ago

I thought this sort-of should be done in the code. The advantage of having one environment variable is that if wait times are too long, you can switch schemes just by changing one variable. But I am open as long as switching is easy.

Ovec8hkin commented 4 years ago

Quick to change may be a benefit of the long form you suggested initially, but its difficult for other people to know what all of the options in JOB_SUBMISSION_SCHEME actually mean and are doing. Its easier to understand if you make the different parts separate environment variables.

falkamelung commented 4 years ago

Sara implemented most of this