Closed falkamelung closed 4 years ago
I tried the parallel queue. It works fine! (see below) The jobs need an ‘&' at the end of the line.
So I suggest to check for the existence of the environment variable below and do the sixteen-jobs-per-node only if the variable exists, as indicated by NEW below. Once this is confirmed to be working we change the rest, which obviously a more significant step.
If we do it this way, it can be in master without affecting anything (unless the variable is set)
-rw-rw--w-+ 1 dwg11 0 Mar 8 13:29 run_9_16parallel_parallel_23348462.e
-rw-rw--w-+ 1 dwg11 71601 Mar 8 13:29 run_9_16parallel_parallel_23348462.o
-rw-rw--w-+ 1 dwg11 0 Mar 8 14:16 run_9_5parallel_parallel_23348482.e
-rw-rw--w-+ 1 dwg11 6283 Mar 8 14:16 run_9_5parallel_parallel_23348482.o
-rw-rw--w-+ 1 dwg11 2266 Mar 8 14:42 run_9_10parallel_parallel.job
-rw-rw--w-+ 1 dwg11 2285 Mar 8 14:46 run_9_16parallel_parallel.job
-rw-rw--w-+ 1 dwg11 0 Mar 8 17:57 run_9_10parallel_parallel_23348605.e
-rw-rw--w-+ 1 dwg11 46207 Mar 8 17:57 run_9_10parallel_parallel_23348605.o
-rw-rw--w-+ 1 dwg11 0 Mar 8 19:41 run_9_16parallel_parallel_allthreads_.23348606.e
-rw-rw--w-+ 1 dwg11 71638 Mar 8 19:41 run_9_16parallel_parallel_allthreads_23348606.o
if os.getenv('JOBSCHEDULER') in ['SLURM', 'sge']:
submit_job_with_launcher(batch_file=batch_file, out_dir=out_dir, memory=maxmemory, walltime=wall_time,
number_of_threads=num_threads, queue=queue)
else:
NEW if os.getenv('JOB_SUBMISSION_SCHEME'):
NEW jobs = submit_sixteen_jobs_per_node(batch_file=batch_file, out_dir=out_dir,
NEW else:
jobs = submit_jobs_individually(batch_file=batch_file, out_dir=out_dir, memory=maxmemory,
walltime=wall_time, queue=queue)
Minor other issues to fix in job_submission.py
On stampede (probably also pegasus) it give the same message whether a job is pending or running, It would be preferred to say ...job run_2_unpack_slave_slc.job pending
before it is running.
run_2_unpack_slave_slc.job submitted as SLURM job #5393944
Waiting for job run_2_unpack_slave_slc.job output file after 0 minutes
Waiting for job run_2_unpack_slave_slc.job output file after 1.0 minutes
Waiting for job run_2_unpack_slave_slc.job output file after 2.0 minutes
Waiting for job run_2_unpack_slave_slc.job output file after 3.0 minutes
Answer from COMET Helpdesk. He talks about a queue with a 1.4 TB SSD disk. This is probably very fast, but after processing, different /merged dirs need to be combined to a common scratch.
Thanks for the info. I think 240 should be ok depending on the IO pattern (if its big block reads we should be fine, if its a lot of small block IO we might see issues). I think we can try it out and watch the load. Maybe do it on Monday so I can ask our storage folks to keep an eye out.
So from what you note the requirement is really 10TB? That is right at our per-user limit. I can set your limit slightly higher so you can do one case comfortably. You mentioned job sizes 3-4 times bigger. Is that compared to the 10TB (i.e are we talking 40TB?). That might be an issue to handle on the IO space side (I would have to get people to clean up a lot to create space and its certainly not comfortable at present).
We do control allocations on Comet to keep wait times low. Our typical expansion factors are below 1.5 so my guess is if you ask for 10 nodes for 2 hours the wait time might be a few hours (I would expect it to be less than a day).
In short, I think if you need 10TB and can run with 240 cores, we can give it a shot and see how things play out. If you need more like 40TB, we definitely need to plan things given how full the filesystem is running right now.
Mahidhar
p.s: All of this might change in the short run if we get a lot of COVID-19 research related simulations. As you may have seen XSEDE resources are being offered for urgent simulations. I have not seen large requests yet but that could change. This might be at all sites though.
On 2020-03-21 19:12:02, famelung@rsmas.miami.edu wrote:
Hi Mahidar,
The ‘wait’ at the end does the job. Thank you! It would be more than 5
tasks. It would be e.g. 240 tasks distributed on 10 nodes.
I hear you regarding the 1.4 TB SSDs. It would require significant
changes to our code. And I am not sure it would work. So this is
not an option for now.
1. A typical job downloads 2TB of data and writes 8TB of temporary
files. The final result is only 10-50 GB. We would like to run jobs
3-4 times bigger. Is there enough space on the Lustre filesystem?
2. How many IO-heavy jobs could I run simultaneously? Would 240 work?
I could limit the number of simultaneous tasks. This issue could be
a reason to stay on Stampede2 where a few hundreds are no problem.
3. For now I am trying to find out whether it is worth to move to your
system. Maybe you can give me some ideas about pending times? If I
ask for 5nodes-4hours, 10nodes-2hours and 20nodes-1hour (hours is
walltime), long would average pending times be? On Stampede it
takes forever (a day, occasionally, and as one workflow consists
of multiple jobs it may take several days), I see on XSEDE that
Comet is at less capacity that is why I am trying.
4. In the shared queue I would be sharing nodes with other users,
correct? This is an option but on the Miami HPC system it
occasionally causes problems.
We install our own python which works for me on Comet.
Thank you!
Falk
On Mar 21, 2020, at 6:50 PM, Mahidhar Tatineni via RT
<help@xsede.org<mailto:help@xsede.org>> wrote:
Sat Mar 21 17:50:56 2020: Request 132476 was acted upon.
Transaction: Correspondence added by mahidhar
Queue: 0-SDSC
Subject: Comet - scratch disk and job submission questions
Owner: mahidhar
Requestors: famelung@rsmas.miami.edu<mailto:famelung@rsmas.miami.edu>
Status: open
Ticket URL:
https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fportal.xsede.org%2Fgroup%2Fxup%2Ftickets%2F-
%2Ftickets%2F132476&data=02%7C01%7Cfamelung%40rsmas.miami.edu%7Cd9706d6aabf34919ee3d08d7cdea5413%7C2a144b72f23942d48c0e6f0f17c48e33%7C0%7C0%7C637204278649727949&sdata=YBzHK4LZhgclkWcKSUOI3IFpCcYBj6%2BlJZCZbJDu798%3D&reserved=0
Falk,
A few things to note (details on the queues are in our user guide):
[1] We have a "shared" partition that allows you to ask for *one* core
specifically and it will land on a shared node. So one thing you
can do is just submit 200 jobs (or an array job with 200 tasks) to
this partition. So you would change:
#SBATCH -p shared
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
and then just have the serial job in the script.
[2] If you want to bundle on the "compute" partition, note that you
are charged the whole node whether you use all 24 cores or not
(unlike the "shared" partition where you are only charged for the
cores requested). So your second test would have been charged for
24 cores even though you are only using 5 tasks.
[3] I think in your second test, you need a "wait" statement at the
end. Otherwise, the script will background all the tasks and just
exit. Also, does your code bind to specific cores or will it just
float and pick up any free core. If there is some binding going on,
you will need to make sure you are picking up free cores and not
binding everything to core 0.
Going back to the inputs - what is the size of your largest dataset?
When you mentioned TBs of data - is that for one of your datasets
or the total for multiple datasets you anticipate? I ask because if
you have 200 jobs using the same dataset and if its smaller than
1.4TB we can still use the local SSD (in fact it is preferred). We
have about a rack of nodes that have 1.4TB of SSD per node. You can
have a tarred copy in Lustre and then untar into local scratch on a
node and just have all 24 tasks use it from there.
Mahidhar
On 2020-03-21 17:17:39,
famelung@rsmas.miami.edu<mailto:famelung@rsmas.miami.edu> wrote:
Hello Mahidar,
Thank you soo much for you quick response. Re [2], unfortunately yes.
Copying over to a local SSD is not really possible. Re [1]: I need
the data just very short time, until one workflow is finished.
Only, we don’t have good cleaning scripts. I probably need to write
them. So for testing using
/oasis/scratch/comet/famelung/temp_project seems the easiest.
But I don’t get my jobs to run. We have serial jobs and would like to
run at least 200 of them at the same time. I tried two things. (1)
A launcher job as we run on stampede2. Unfortunately it throws
errors. (2) a simple 1-node job with 5 tasks. It did not produce
results. For the latter, if I don’t add an ‘&’ at the end they run
but one after the other.
I must be doing something wrong. Any suggestions are appreciated.
Thank you
Falk
#################
1. Launcher job plus error messages
#################
//comet-
ln2/oasis/scratch/comet/famelung/temp_project/unittestGalapagosSenDT128/run_files[457]
cat run_1_unpack_slc_topo_master.job
#! /bin/bash
#SBATCH -J run_1_unpack_slc_topo_master
#SBATCH -A TG-EAR180014
#SBATCH --mail-user=famelung@rsmas.miami.edu<mailto:mail-
user=famelung@rsmas.miami.edu><mailto:mail-
user=famelung@rsmas.miami.edu<mailto:user=famelung@rsmas.miami.edu>>
#SBATCH --mail-type=fail
#SBATCH -N 1
#SBATCH -n 6
#SBATCH -o
/oasis/scratch/comet/famelung/temp_project/unittestGalapagosSenDT128/run_files/run_1_unpack_slc_topo_master_%J.o
#SBATCH -e
/oasis/scratch/comet/famelung/temp_project/unittestGalapagosSenDT128/run_files/run_1_unpack_slc_topo_master_%J.e
#SBATCH -p compute
#SBATCH -t 02:00:00
module load launcher
#export OMP_NUM_THREADS=16
export
LAUNCHER_WORKDIR=/oasis/scratch/comet/famelung/temp_project/unittestGalapagosSenDT128/run_files
export
LAUNCHER_JOB_FILE=/oasis/scratch/comet/famelung/temp_project/unittestGalapagosSenDT128/run_files/run_1_unpack_slc_topo_master
$LAUNCHER_DIR/paramrun
//comet-
ln2/oasis/scratch/comet/famelung/temp_project/unittestGalapagosSenDT128/run_files[458]
cat run_1_unpack_slc_topo_master_32216411.o
Launcher: Setup complete.
------------- SUMMARY ---------------
Number of hosts: 1
Working directory:
/oasis/scratch/comet/famelung/temp_project/unittestGalapagosSenDT128/run_files
Processes per host: 6
Total processes: 6
Total jobs: 5
Scheduling method: block
-------------------------------------
Launcher: Starting parallel tasks...
//comet-
ln2/oasis/scratch/comet/famelung/temp_project/unittestGalapagosSenDT128/run_files[459]
head run_1_unpack_slc_topo_master_32216411.e
ModuleCmd_Load.c(213):ERROR:105: Unable to locate a modulefile for
'launcher'
using /tmp/launcher.32216411.hostlist.GU4Isct0 to get hosts
starting job on comet-01-46
ssh: threads.c:353: krb5int_key_register: Assertion
`destructors_set[keynum] == 0' failed.
/home/famelung/test/operations/rsmas_insar/3rdparty/launcher/paramrun:
line 4: 24338 Aborted (core dumped) "$@"
ssh: threads.c:353: krb5int_key_register: Assertion
`destructors_set[keynum] == 0' failed.
/home/famelung/test/operations/rsmas_insar/3rdparty/launcher/paramrun:
line 4: 24349 Aborted (core dumped) "$@"
ssh: threads.c:353: krb5int_key_register: Assertion
`destructors_set[keynum] == 0' failed.
/home/famelung/test/operations/rsmas_insar/3rdparty/launcher/paramrun:
line 4: 24356 Aborted (core dumped) "$@"
ssh: threads.c:353: krb5int_key_register: Assertion
`destructors_set[keynum] == 0' failed.
#################
Regular job that runs very shortly but does not produce anything (it
does run well without ‘&’ but one-task-after-other )
#################
cat q1.job
#! /bin/bash
#SBATCH -J run_2_average_baseline
#SBATCH -A TG-EAR180014
#SBATCH --mail-user=famelung@rsmas.miami.edu<mailto:mail-
user=famelung@rsmas.miami.edu><mailto:mail-
user=famelung@rsmas.miami.edu<mailto:user=famelung@rsmas.miami.edu>>
#SBATCH --mail-type=fail
#SBATCH -N 1
#SBATCH -n 5
#SBATCH -o
/oasis/scratch/comet/famelung/temp_project/unittestGalapagosSenDT128/run_files/run_2_average_baseline_%J.o
#SBATCH -e
/oasis/scratch/comet/famelung/temp_project/unittestGalapagosSenDT128/run_files/run_2_average_baseline_%J.e
#SBATCH -p compute
#SBATCH -t 00:06:00
SentinelWrapper.py -c
/oasis/scratch/comet/famelung/temp_project/unittestGalapagosSenDT128/configs/config_slave_20160629
&
SentinelWrapper.py -c
/oasis/scratch/comet/famelung/temp_project/unittestGalapagosSenDT128/configs/config_slave_20160723
&
SentinelWrapper.py -c
/oasis/scratch/comet/famelung/temp_project/unittestGalapagosSenDT128/configs/config_slave_20160804
&
SentinelWrapper.py -c
/oasis/scratch/comet/famelung/temp_project/unittestGalapagosSenDT128/configs/config_slave_20160816
&
SentinelWrapper.py -c
/oasis/scratch/comet/famelung/temp_project/unittestGalapagosSenDT128/configs/config_slave_20160828
&
//comet-
ln2/oasis/scratch/comet/famelung/temp_project/unittestGalapagosSenDT128/run_files[471]
sbatch < q1.job
Submitted batch job 32216795
//comet-
ln2/oasis/scratch/comet/famelung/temp_project/unittestGalapagosSenDT128/run_files[472]
squeue -u ${USER}
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
32216795 compute run_2_av famelung R 0:04 1
comet-11-33
On Mar 19, 2020, at 9:17 PM, Mahidhar Tatineni via RT
<help@xsede.org<mailto:help@xsede.org><mailto:help@xsede.org>>
wrote:
Thu Mar 19 20:17:08 2020: Request 132476 was acted upon.
Transaction: Correspondence added by mahidhar
Queue: 0-SDSC
Subject: Comet - scratch disk and job submission questions
Owner: mahidhar
Requestors:
famelung@rsmas.miami.edu<mailto:famelung@rsmas.miami.edu><mailto:famelung@rsmas.miami.edu>
Status: open
Ticket URL:
https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fportal.xsede.org%2Fgroup%2Fxup%2Ftickets%2F-
%2Ftickets%2F132476&data=02%7C01%7Cfamelung%40rsmas.miami.edu<https://nam01.safelinks.protection.outlook.com/?url=http%3A%2F%2F40rsmas.miami.edu&data=02%7C01%7Cfamelung%40rsmas.miami.edu%7Cf29cebe8e97341db90fd08d7ce0c9080%7C2a144b72f23942d48c0e6f0f17c48e33%7C0%7C0%7C637204425682454514&sdata=%2FiVtDGl92SPRSiHanUFt87phYrkxyclYqhLllmoZmus%3D&reserved=0>%7Cec632e3476c4429c940208d7cc6c6a16%7C2a144b72f23942d48c0e6f0f17c48e33%7C0%7C0%7C637202638360530954&sdata=YoOImmtZcblxDud8YLfKO7w6AU0AepU2ctayZKgL6cA%3D&reserved=0
Falk,
Answers:
[1] We don't auto-purge this location but since the filesystem is
quite full we do tell users to periodically clean up. How long do
you anticipate needing the data? If its a few weeks to a month it
should be ok (ideally we want clean up after a maximum of 3 months)
[2] Does every serial job need access to all the data? If its a subset
of the dataset, I suggest copying to local SSD on each node. If you
can do this, you should be able to scale up pretty easily. If not,
1000 will be a problem I think. I would suggest doing fewer nodes
but for longer duration.
[3] Depending on the answer to [2], we might be able to run in the
"shared" partition instead of using launcher. Either option is ok.
Depending on the IO load we might have to choose the smaller number
of nodes with longer run times.
Mahidhar
On 2020-03-19 19:58:51,
famelung@rsmas.miami.edu<mailto:famelung@rsmas.miami.edu><mailto:famelung@rsmas.miami.edu>
wrote:
Hi,
I just started to examine whether Comet is appropriate for our data
processing jobs. We need a few TB of scratch space for our jobs, in
part for raw data which are downloaded at the beginning of a the
workflow. I was thinking to use
/oasis/scratch/comet/$USER/temp_project
Here a couple of questions:
1. Will this disk be purged, how frequently, and is disk usage free or
charged to my account?
2. Our data processing consists of IO-intensive serial jobs. How many
processes can safely access this t disk at the same time? (For
Stampede I was told about 1000 simultaneous processes are fine,,
which is a surprisingly high number. )
3. Our processing workflow consists of 10 steps, each consisting of,
say, 240 serial jobs each requiring 30 minutes walltime. A new
processing step is started when the previous step is completed (I
plan to use launcher for serial job submission which we use on
Stampede). For each step we could request 10 nodes and 30 minutes
walltime, 5 nodes with 60 minutes walltime or 2 nodes with 150
minutes walltime (run in 120 jobs in a sequential manner on each
node). What would you suggest? Asking for a large number of nodes
with short walltime or for a small number of nodes but longer wall
times?
Thank you
Falk
JOB_SUBMISSION_SCHEME=singlefile_serial_LSF_general # pegasus default
JOB_SUBMISSION_SCHEME=singlefile_parallel_LSF_general
JOB_SUBMISSION_SCHEME=singlefile_parallel_LSF_parallel
JOB_SUBMISSION_SCHEME=multifile_parallel_LSF_parallel # triton default
JOB_SUBMISSION_SCHEME=singlefile_launcher_LSF_parallel
JOB_SUBMISSION_SCHEME=singlefile_launcher_SLURM_skx-normal # stampede2 default
JOB_SUBMISSION_SCHEME=multifile_launcher_SLURM_skx-normal
JOB_SUBMISSION_SCHEME=singlefile_serial_SLURM_shared # Comet optional
JOB_SUBMISSION_SCHEME=singlefile_parallel_SLURM_compute # Comet default
JOB_SUBMISSION_SCHEME=singlefile_serial_PBS_general # deqing
Speaking of the above syntax, I would break this up to be more readable. I would specify the following distinct variables:
CLUSTER_TYPE = SLURM/LSF/PBS
LAUNCHER = serial/parallel/launcher
QUEUE = #whatever queue is being used
NUM_FILES = singlefile/multifile
These could be specified in the platforms_defaults.bash
file for the appropriate systems.
I thought this sort-of should be done in the code. The advantage of having one environment variable is that if wait times are too long, you can switch schemes just by changing one variable. But I am open as long as switching is easy.
Quick to change may be a benefit of the long form you suggested initially, but its difficult for other people to know what all of the options in JOB_SUBMISSION_SCHEME
actually mean and are doing. Its easier to understand if you make the different parts separate environment variables.
Sara implemented most of this
JOB_SUBMISSION_SCHEME= RUNFILETYPE_METHOD_SCHEDULER_QUEUENAME
These environment variables should be read at the beginning job_submission.py :
and will be available as PARAMS.scheduler, PARAMS.queue, PARAMS.number_of_cores_per_node, etc (Sara, we have PARAMS already, shall we use something else?)
We need these functions (run from execute_runfiles.py):
submit_launcher_jobs (currently: submit_with_launcher) submit_serial_jobs (currently: submit_jobs_individually) submit_parallel_jobs
write_job_file ( run_file_type=singlefile, method= serial)
(currently we have for method=serial write_batch_job_files and for method=launcher the lines of submit_with_launcher until 'job_number = submit_single_job...' )
get_job_header_lines() get_job_file_lines(run_file_list)
cat /nethome/sxh733/development/falk/rsmas_insar/sources/roipac/INT_SCR/split_jobs.py cat /nethome/sxh733/development/falkold_keep/rsmas_insar/sources/roipac/INT_SCR/split_igramJobs_filt.py cat /nethome/sxh733/development/rsmas_insar/sources/roipac/INT_SCR/runJobs.py