Closed falkamelung closed 3 years ago
So requested options are: --max_jobs
, --sleep
, --job_name
, and --max_nodes
?
That is true. We could have a submit_jobs_limits.cfg
in the /defaults directory, in a format similar to https://github.com/geodesymiami/rsmas_insar/blob/master/minsar/defaults/job_defaults.cfg
It will only have few entries, but can grow as needed. Eventually the cfg files will be merged. You could write the .cfg reader such that it will be easy to switch to a format similar to:
------------------------------------------------------------------------------------------------------
name c_walltime s_walltime c_memory s_memory num_threads max_tasks
-----------------------------------------------------------------------------------------------------
default 02:00:00 0 3000 0 2 -
# topsStack
unpack_topo_reference 0 00:01:00 4000 0 8 -
unpack_secondary_slc 0 00:00:10 4000 0 2 500
average_baseline 0 00:00:10 1000 0 2 -
extract_burst_overlaps 0 00:00:10 4000 0 2 800
@falkamelung There is no easy way to get a listing of active jobs by name from squeue
. squeue -n
can list jobs that exactly match the given name, but it doesn't support pattern matching (so I can't do squeue -n *pairs_misreg*
). Please determine how to get the number of active nodes for ONLY jobs matching the desired naming pattern.
Additionally, please elaborate on the tasks computation. What is the maximum number, how do I compute the current number of active tasks?
Did you see https://github.com/geodesymiami/rsmas_insar/issues/450#issuecomment-757153866 ?
You grep the file name and then run wc -l
. For nodes you could dom something similar but we better do tasks and I would think it works.
Yes I saw the previous comment.
Is the number of tasks independent of the number of nodes utilized? What is the maximum number of tasks that can be submitted? How do I check the number of currently running tasks (reading all of the files isn't very efficient and seems over complicated)? Why can't we use nodes anymore?
The number of tasks running on each job varies depending on memory requirement. Sometimes all 48 cores are used and sometimes only 10. TACC asked me to limit the number of jobs but it is the number of simultaneous tasks which matters in terms of IO load.
The max_tasks depends on the run_0* step. For now just use 500.
Yes, it is not efficient. But I don't know how else to do it. Another option would be to store all the relevant job parameters in one file which is read by submit_jobs.bash, but this sounds more complicated.
We could generate a file with the number of task and read it at the beginning of submit_jobs.bash
wc -l run_*_{0,1,2,3,4,5,6,,8,9} | sort -k2,2
wc: run_*_: No such file or directory
1 run_01_unpack_topo_reference_0
37 run_02_unpack_secondary_slc_0
37 run_02_unpack_secondary_slc_1
37 run_02_unpack_secondary_slc_2
34 run_02_unpack_secondary_slc_3
37 run_03_average_baseline_0
37 run_03_average_baseline_1
37 run_03_average_baseline_2
34 run_03_average_baseline_3
21 run_04_fullBurst_geo2rdr_0
21 run_04_fullBurst_geo2rdr_1
21 run_04_fullBurst_geo2rdr_2
21 run_04_fullBurst_geo2rdr_3
21 run_04_fullBurst_geo2rdr_4
21 run_04_fullBurst_geo2rdr_5
19 run_04_fullBurst_geo2rdr_6
21 run_05_fullBurst_resample_0
21 run_05_fullBurst_resample_1
21 run_05_fullBurst_resample_2
21 run_05_fullBurst_resample_3
21 run_05_fullBurst_resample_4
21 run_05_fullBurst_resample_5
19 run_05_fullBurst_resample_6
1 run_06_extract_stack_valid_region_0
38 run_07_merge_reference_secondary_slc_0
38 run_07_merge_reference_secondary_slc_1
38 run_07_merge_reference_secondary_slc_2
38 run_07_merge_reference_secondary_slc_3
48 run_08_generate_burst_igram_0
48 run_08_generate_burst_igram_1
48 run_08_generate_burst_igram_2
48 run_08_generate_burst_igram_3
48 run_08_generate_burst_igram_4
48 run_08_generate_burst_igram_5
48 run_08_generate_burst_igram_6
48 run_08_generate_burst_igram_8
48 run_08_generate_burst_igram_9
48 run_09_merge_burst_igram_0
48 run_09_merge_burst_igram_1
48 run_09_merge_burst_igram_2
48 run_09_merge_burst_igram_3
48 run_09_merge_burst_igram_4
48 run_09_merge_burst_igram_5
48 run_09_merge_burst_igram_6
48 run_09_merge_burst_igram_8
48 run_09_merge_burst_igram_9
48 run_10_filter_coherence_0
48 run_10_filter_coherence_1
48 run_10_filter_coherence_2
48 run_10_filter_coherence_3
48 run_10_filter_coherence_4
48 run_10_filter_coherence_5
48 run_10_filter_coherence_6
48 run_10_filter_coherence_8
48 run_10_filter_coherence_9
48 run_11_unwrap_0
48 run_11_unwrap_1
48 run_11_unwrap_2
48 run_11_unwrap_3
48 run_11_unwrap_4
48 run_11_unwrap_5
48 run_11_unwrap_6
48 run_11_unwrap_8
48 run_11_unwrap_9
Ok. 1 and 2 are done (not pushed yet).
How do I check the number of active wget processes running?
re 3: What is a valid command line call to ssara_federated_query.bash
? I don't know what the command line options are. Send examples for both "unittestGalapagosDT128" and "KilaueaSenAT124".
For Kilauea:
ssara_federated_query.bash --platform=SENTINEL-1A,SENTINEL-1B --relativeOrbit=124 --intersectsWith='Polygon((-155.80 19.00, -155.80 20.00, -154.50 20.00, -154.50 19.00, -155.80 19.00))' --end=2019-12-31 --parallel=5 --print --download
unittestGalapagos
does not do any download, but $SAMPLESDIR/GalapagosSenDT128.template
does (it is the same as unittestGalapagos but with download)
minsarApp.bash $SAMPLESDIR/GalapagosSenDT128.template
The download command is in ssara_command.txt
Ok. Is it simple enough to run ps aux | grep "wget" | wc -l
to determine number of active wget processes?
I think so. You have to remove the timeout
processes for counting.
Is there a reason we don't do something like this:
for u in "${urls[@]}"; do
timeout $timeout wget --continue --user $user --password $passwd $u
done
but instead run the wget command through xargs? I know initially it was because xargs allows us to run wget in parallel, but it seems like a simple for loop would let us do the same thing (it would just spawn additional, independent wget processes). I don't think we can use the xargs syntax in conjunction with wget limits, as the xargs syntax submits all of the wget requests at once and just waits for them to become active.
I think so. The xargs command worked, but my impression was that it does not give the flexibility we need.
Ok. I will translate to the above for loop syntax then. Do we need to check the exit status of each and every wget process for failure? Thats becomes complicated when lots of processes are running in parallel.
Hmmm.... The wget failures were always a problem but lately downloads worked fine. I am not sure whether this is because the data source was stable or because of our code. I suspect because of the first. I am not positive that our code works. Even when downloading worked fine, I see messages like:
20210114:13-01 * ssara_federated_query.bash --platform=SENTINEL-1A,SENTINEL-1B --relativeOrbit=124 --intersectsWith='Polygon((-155.80 19.00, -155.80 20.00, -154.50 20.00, -154.50 19.00, -155.80 19.00))' --end=2019-12-31 --parallel=5 --print --download
20210114:13-01 * ssara_federated_query.bash --platform=SENTINEL-1A,SENTINEL-1B --relativeOrbit=124 --intersectsWith='Polygon((-155.80 19.00, -155.80 20.00, -154.50 20.00, -154.50 19.00, -155.80 19.00))' --end=2019-12-31 --parallel=5 --print --download
20210114:13-01 * Something went wrong. Exit code was 123. Trying again with second timeout
So, we probably can get rid of the exit_code checking, but can we have another way to check whether it was successful? Maybe the --spider option can be used for this: https://stackoverflow.com/questions/6264726/can-i-use-wget-to-check-but-not-download
Also could be that the exit code checking was linked to the timeout
command.
We also have a script check_downloads.py --delete
which I use when things don't work. This could be run inside ssara_federated_query.bash
. I also was considering to run this in minsarApp.bash
.
I probably need to take back that the exit_code checking is not working. I do see some downloads where ssara
was run again. I might have done manually, but I don't think so.
cat KokoxiliBigSenAT41/SLC/log
20210111:112010 * ssara_federated_query.py --platform=SENTINEL-1A,SENTINEL-1B --relativeOrbit=41 --intersectsWith='Polygon((87.00 29.50, 87.00 42.50, 98.00 42.50, 98.00 29.50, 87.00 29.50))' --start=2014-01-01 --kml --maxResults=20000
20210111:11-01 * ssara_federated_query.bash --platform=SENTINEL-1A,SENTINEL-1B --relativeOrbit=41 --intersectsWith='Polygon((87.00 29.50, 87.00 42.50, 98.00 42.50, 98.00 29.50, 87.00 29.50))' --start=2014-01-01 --parallel=5 --print --download
20210111:14-01 * Something went wrong. Exit code was 123. Trying again with second timeout
20210111:14-01 * Something went wrong. Exit code was 123. Trying again with second timeout
20210111:14-01 * ssara_federated_query.bash --platform=SENTINEL-1A,SENTINEL-1B --relativeOrbit=41 --intersectsWith='Polygon((87.00 29.50, 87.00 42.50, 98.00 42.50, 98.00 29.50, 87.00 29.50))' --start=2014-01-01 --parallel=5 --print --download
20210111:21-01 * Something went wrong. Exit code was 123. Trying again with second timeout
20210112:05-01 * ssara_federated_query.bash --platform=SENTINEL-1A,SENTINEL-1B --relativeOrbit=41 --intersectsWith='Polygon((87.00 29.50, 87.00 42.50, 98.00 42.50, 98.00 29.50, 87.00 29.50))' --start=2014-01-01 --parallel=5 --print --download
cat ManyiBigSenAT12/SLC/log
20210110:223210 * ssara_federated_query.py --platform=SENTINEL-1A,SENTINEL-1B --relativeOrbit=12 --intersectsWith='Polygon((80.00 29.50, 80.00 41.50, 94.00 41.50, 94.00 29.50, 80.00 29.50))' --start=2014-01-01 --kml --maxResults=20000
20210110:22-01 * ssara_federated_query.bash --platform=SENTINEL-1A,SENTINEL-1B --relativeOrbit=12 --intersectsWith='Polygon((80.00 29.50, 80.00 41.50, 94.00 41.50, 94.00 29.50, 80.00 29.50))' --start=2014-01-01 --parallel=5 --print --download
20210110:22-01 * Something went wrong. Exit code was 123. Trying again with second timeout
20210110:22-01 * Something went wrong. Exit code was 123. Trying again with second timeout
20210110:22-01 * ssara_federated_query.bash --platform=SENTINEL-1A,SENTINEL-1B --relativeOrbit=12 --intersectsWith='Polygon((80.00 29.50, 80.00 41.50, 94.00 41.50, 94.00 29.50, 80.00 29.50))' --start=2014-01-01 --parallel=5 --print --download
20210110:22-01 * Something went wrong. Exit code was 123. Trying again with second timeout
20210110:22-01 * Something went wrong. Exit code was 123. Trying again with second timeout
20210110:22-01 * ssara_federated_query.bash --platform=SENTINEL-1A,SENTINEL-1B --relativeOrbit=12 --intersectsWith='Polygon((80.00 29.50, 80.00 41.50, 94.00 41.50, 94.00 29.50, 80.00 29.50))' --start=2014-01-01 --parallel=5 --print --download
20210111:08-01 * Something went wrong. Exit code was 123. Trying again with second timeout
Proper checking for download success and re-downloading remains a key feature of our software.
I found a way to link the wget PID to each individual download file, so I should be able to check whether each PID exited cleanly or not, and if not redo the failed download.
Ok. I think I got it figured out now (some test are running currently to check). What I ended up doing is batching the list of download URLs into 5-element groups and passing each group of 5 to wget throughout xargs to run in parallel (5 at a time). This ensures that only 5 wget processes are running at any one time, and that the processes are running in parallel. Once one group of five finishes, it yields and exit code indicating success or failure of the whole group, which I can then use to rerun is necessary. Once a group of 5 is finishes completely, without errors, it moves to the next batch.
This is the easiest way of doing this while keeping parallel download capabilities and the ability to rerun on failure. However, this does not guarantee that 5 wget processes are always running at any given time, as a whole batch of 5 are started at once, but may finish at different times, and a new set of wget processes aren't submitted until all 5 are finished.
1, 2, 3 are finished and pushed. Please verify proper outputs.
Great! Thank you so much !
A few items as I will be with this script for a long time and expand on it:
Can you make it a bit more verbose? The 18 - 500 - 6
is probably something on the number of tasks. It should show something like average_baseline: number of running/pending tasks 200
"Maximum number of nodes for step (fullBurst_geo2rdr) is 400” is probably "tasks”
when it can’t submit because there are too many tasks running/pending, can you show something similar to the MAX_JOBS? Similar to:
echo "Couldnt submit job (${file}), because there are $MAX_JOBS_PER_QUEUE active jobs right now. Waiting 5 minutes to submit next job."
when it can’t submit because there are too many tasks running/pending, can you show something similar to the MAX_JOBS? Similar to: echo "Couldnt submit job (${file}), because there are $MAX_JOBS_PER_QUEUE active jobs right now. Waiting 5 minutes to submit next job."
when the step_tasks
are set, can you do with line continuations? I forgot the bash syntax, something like
step_tasks=( \
[unpack_topo_reference]=500 \
[unpack_secondary_slc]=500 \
[average_baseline]=500 \
[extract_burst_overlaps]=500 \
[overlap_geo2rdr]=70 \
[filter_coherence]=500 \
…
[unwrap]=500 \
)
We actually should use step_max_task_list
(or step_max_task_array
) and for one element step_maxtasks
Let’s use python-style variable naming: runningtasks
—> running_tasks
Let’s try to use very similar variable names for the tasks as for the jobs:
running_tasks
—> num_active_tasks
(tasks for pending jobs are also counted)
wc -l /scratch/05861/tg851601/qqunittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr
6 /scratch/05861/tg851601/qqunittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr
18 - 500 - 6
Maximum number of nodes for step (average_baseline) is 500
Number of running/pending jobs: 20
Jobs submitted: 7131633
qqunittestGalapagosSenDT128 run_03_average_baseline_0.job, 7131633 is not finished yet. Current state is 'PENDING '
qqunittestGalapagosSenDT128 run_03_average_baseline_0.job, 7131633 is not finished yet. Current state is 'PENDING '
qqunittestGalapagosSenDT128 run_03_average_baseline_0.job, 7131633 is not finished yet. Current state is 'PENDING '
qqunittestGalapagosSenDT128 run_03_average_baseline_0.job, 7131633 is not finished yet. Current state is 'PENDING '
qqunittestGalapagosSenDT128 run_03_average_baseline_0.job, 7131633 is not finished yet. Current state is 'PENDING '
qqunittestGalapagosSenDT128 run_03_average_baseline_0.job, 7131633 is not finished yet. Current state is 'PENDING '
qqunittestGalapagosSenDT128 run_03_average_baseline_0.job, 7131633 is not finished yet. Current state is 'PENDING '
qqunittestGalapagosSenDT128 run_03_average_baseline_0.job, 7131633 is not finished yet. Current state is 'PENDING '
qqunittestGalapagosSenDT128 run_03_average_baseline_0.job, 7131633 is not finished yet. Current state is 'PENDING '
qqunittestGalapagosSenDT128 run_03_average_baseline_0.job, 7131633 is not finished yet. Current state is 'PENDING '
qqunittestGalapagosSenDT128 run_03_average_baseline_0.job, 7131633 is not finished yet. Current state is 'PENDING '
qqunittestGalapagosSenDT128 run_03_average_baseline_0.job, 7131633 is not finished yet. Current state is 'PENDING '
qqunittestGalapagosSenDT128 run_03_average_baseline_0.job, 7131633 is not finished yet. Current state is 'PENDING '
qqunittestGalapagosSenDT128 run_03_average_baseline_0.job, 7131633 is not finished yet. Current state is 'PENDING '
qqunittestGalapagosSenDT128 run_03_average_baseline_0.job, 7131633 is not finished yet. Current state is 'RUNNING '
7131633 is complete
Job7131633 is finished.
check_job_outputs.py /scratch/05861/tg851601/qqunittestGalapagosSenDT128/run_files/run_03_average_baseline_0.job
This is the Open Source version of ISCE.
Some of the workflows depend on a separate licensed package.
To obtain the licensed package, please make a request for ISCE
through the website: https://download.jpl.nasa.gov/ops/request/index.cfm.
Alternatively, if you are a member, or can become a member of WinSAR
you may be able to obtain access to a version of the licensed sofware at
https://winsar.unavco.org/software/isce
checking: /scratch/05861/tg851601/qqunittestGalapagosSenDT128/run_files/run_03_average_baseline_0.job
no error found
Jobfiles to run: /scratch/05861/tg851601/qqunittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr_0.job
cat: /scratch/05861/tg851601/qqunittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr_2: No such file or directory
cat: /scratch/05861/tg851601/qqunittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr_3: No such file or directory
cat: /scratch/05861/tg851601/qqunittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr_4: No such file or directory
cat: /scratch/05861/tg851601/qqunittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr_5: No such file or directory
cat: /scratch/05861/tg851601/qqunittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr_6: No such file or directory
cat: /scratch/05861/tg851601/qqunittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr_1: No such file or directory
cat: /scratch/05861/tg851601/qqunittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr_2: No such file or directory
cat: /scratch/05861/tg851601/qqunittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr_3: No such file or directory
cat: /scratch/05861/tg851601/qqunittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr_4: No such file or directory
cat: /scratch/05861/tg851601/qqunittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr_5: No such file or directory
cat: /scratch/05861/tg851601/qqunittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr_6: No such file or directory
cat: /scratch/05861/tg851601/qqunittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr_5: No such file or directory
cat: /scratch/05861/tg851601/qqunittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr_2: No such file or directory
cat: /scratch/05861/tg851601/qqunittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr_4: No such file or directory
cat: /scratch/05861/tg851601/qqunittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr_3: No such file or directory
cat: /scratch/05861/tg851601/qqunittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr_1: No such file or directory
cat: /scratch/05861/tg851601/qqunittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr_1: No such file or directory
24 - 400 - 6
Maximum number of nodes for step (fullBurst_geo2rdr) is 400
Number of running/pending jobs: 20
Jobs submitted: 7131665
qqunittestGalapagosSenDT128 run_04_fullBurst_geo2rdr_0.job, 7131665 is not finished yet. Current state is ''
qqunittestGalapagosSenDT128 run_04_fullBurst_geo2rdr_0.job, 7131665 is not finished yet. Current state is 'PENDING '
qqunittestGalapagosSenDT128 run_04_fullBurst_geo2rdr_0.job, 7131665 is not finished yet. Current state is 'PENDING '
qqunittestGalapagosSenDT128 run_04_fullBurst_geo2rdr_0.job, 7131665 is not finished yet. Current state is 'PENDING '
qqunittestGalapagosSenDT128 run_04_fullBurst_geo2rdr_0.job, 7131665 is not finished yet. Current state is 'PENDING '
The above all sound like logging updates which should be easily implemented. Is this indication that the script work as intended?
No, I don't think it works. I don't think it properly gets the number of active tasks but this is hard to tell with the current logging. These messages look like resulting from a bug:
Jobfiles to run: /scratch/05861/tg851601/qqunittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr_0.job
cat: /scratch/05861/tg851601/qqunittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr_2: No such file or directory
cat: /scratch/05861/tg851601/qqunittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr_3: No such file or directory
cat: /scratch/05861/tg851601/qqunittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr_4: No such file or directory
Once the messaging/logging is more meaningful I can say whether it works. How about you first fix these items and then I try again. If there are issues remaining I can help you to run multiple workflows simultaneously if you have not done that.
I just want to share this one. To test the new feature I need to have jobs running like shown here. When I try to submit new jobs it should the number of active tasks. To test I will temporarily reduce step_max_tasks
and see whether it properly calculates the number of active tasks.
sjobs
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
7132424 skx-normal run_10_f tg851601 CG 5:27 5 c486-[042,062,091,103,113]
7132425 skx-normal run_10_f tg851601 CG 5:17 5 c487-024,c488-[001,011,062,092]
7132426 skx-normal run_10_f tg851601 CG 5:22 5 c489-014,c491-[103-104],c506-[104,111]
7132427 skx-normal run_10_f tg851601 PD 0:00 5 (Priority)
7132428 skx-normal run_10_f tg851601 PD 0:00 5 (Priority)
7132429 skx-normal run_10_f tg851601 PD 0:00 5 (Priority)
7132430 skx-normal run_10_f tg851601 PD 0:00 5 (Priority)
7132431 skx-normal run_10_f tg851601 PD 0:00 5 (Priority)
7132432 skx-normal run_10_f tg851601 PD 0:00 5 (Priority)
7132433 skx-normal run_10_f tg851601 PD 0:00 5 (Priority)
7132434 skx-normal run_10_f tg851601 PD 0:00 5 (Priority)
7132775 skx-normal run_08_g tg851601 PD 0:00 3 (Priority)
7132776 skx-normal run_08_g tg851601 PD 0:00 3 (Priority)
7132777 skx-normal run_08_g tg851601 PD 0:00 3 (Priority)
7132778 skx-normal run_08_g tg851601 PD 0:00 3 (Priority)
7132983 skx-normal run_08_g tg851601 PD 0:00 3 (Priority)
7133116 skx-normal run_08_g tg851601 PD 0:00 3 (Priority)
7132423 skx-normal run_10_f tg851601 R 5:27 5 c484-072,c485-[061,114],c486-[011,023]
And once this works we have to address https://github.com/geodesymiami/rsmas_insar/issues/450#issuecomment-759107761. , which is to limit the total load on the system.
I do have a plan now. For this we need to get the number of all active tasks. Would that work?
We define for each step an io_load
. The steps with heavy IO get a 1
and those t=with little IO get a 0
or 0.1
:
io_load_step1 = 1
io_load_step2 = 0.2
io_load_step4 = 1
io_lload_step5 = 1
io_load_step6 = 0.5
io_load=step7 = 0.01
Before submitting a new job we calculate total_load
:
total_load = num_active_tasks_step1*io_load_step1 + num_active_tasks_step2*io_load_step2 + num_active_tasks_step3*io_load_step3
We then keep total_load
below a prescribed max_total_load
. For example, for max_total_load=1000
we can have 400 step4 tasks and 600 step5 tasks running. Alternatively we can have 500 step5 and 1000 step6 tasks running.
This should solve the problem. Once we have implemented this I will work with TACC to get numbers for io_load
for each step.
Make the above a separate issue.
I pushed logging updates just now. Please verify things are working. It looks like I may have made a small error in the tasks computation earlier, but it should be fine now. The logging I see checks out with the by-hand calculations I did.
It is still not working.
When I submitted there was 1 running and 1 pending run_04
jobs (with 23 and 19 tasks). It should say that there are 42 running/pending tasks but it says 0 running/pending tasks
//login3/scratch/05861/tg851601/unittestGalapagosSenDT128[1035] sjobs
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
7134453 skx-dev run_04_f tg851601 PD 0:00 1 (Priority)
7133741 skx-normal run_04_f tg851601 PD 0:00 1 (Priority)
7134008 skx-normal run_06_e tg851601 PD 0:00 1 (Priority)
7134133 skx-normal smallbas tg851601 PD 0:00 1 (Priority)
7133740 skx-normal run_04_f tg851601 R 1:04:11 1 c506-132
submit_jobs.bash $PWD --start 4
Jobfiles to run: /scratch/05861/tg851601/unittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr_0.job
cat: /scratch/05861/tg851601/unittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr_5: No such file or directory
cat: /scratch/05861/tg851601/unittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr_4: No such file or directory
fullBurst_geo2rdr: number of running/pending tasks is 0 (maximum 500)
/scratch/05861/tg851601/unittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr_0.job: 6 additional tasks
fullBurst_geo2rdr: 6 total tasks (maximum 500)
Number of running/pending jobs: 4
Jobs submitted: 7134461
unittestGalapagosSenDT128 run_04_fullBurst_geo2rdr_0.job, 7134461 is not finished yet. Current state is 'PENDING '
unittestGalapagosSenDT128 run_04_fullBurst_geo2rdr_0.job, 7134461 is not finished yet. Current state is 'PENDING '
unittestGalapagosSenDT128 run_04_fullBurst_geo2rdr_0.job, 7134461 is not finished yet. Current state is 'PENDING '
//login3/scratch/05861/tg851601/unittestGalapagosSenDT128[1038] scontrolddp 7133740
Command=/scratch/05861/tg851601/KokoxiliBigChunk33SenDT48/run_files/run_04_fullBurst_geo2rdr_4.job
//login3/scratch/05861/tg851601/unittestGalapagosSenDT128[1039] wc -l /scratch/05861/tg851601/KokoxiliBigChunk33SenDT48/run_files/run_04_fullBurst_geo2rdr_4
23 /scratch/05861/tg851601/KokoxiliBigChunk33SenDT48/run_files/run_04_fullBurst_geo2rdr_4
scontrolddp 7133741
Command=/scratch/05861/tg851601/KokoxiliBigChunk33SenDT48/run_files/run_04_fullBurst_geo2rdr_5.job
//login3/scratch/05861/tg851601/unittestGalapagosSenDT128[1042] wc -l /scratch/05861/tg851601/KokoxiliBigChunk33SenDT48/run_files/run_04_fullBurst_geo2rdr_5
19 /scratch/05861/tg851601/KokoxiliBigChunk33SenDT48/run_files/run_04_fullBurst_geo2rdr_5
I can't replicate that behavior. How do you generate more than 1 run file for each job step for the unittestGalapagos dataset?
Just run Kilauea if the data has downloaded? Or make a copy of unittestGalapagos to qqunittestGalapagos and run that you have something running? I can start for you but it should be now as I have to leave in 20 minutes
I don't get an issue running Kilauea. The task computations are correct. Are you processing multiple dataset concurrently maybe? I haven't tried that.
I had 3 workflows running
Ok. I see the problem I think.
Can confirm I know what the problem is. The task computation assumed that all actively running jobs had runfiles in the same directory, which meant that I didn't have to look up job info through scontrol
. Ill probably get to fixing this either late tonight or tomorrow afternoon.
This should be fixed now.
Hi @Ovec8hkin , Unfortunately there is still a bug. It happens when it reaches the 25 job limit (I think, unfortunately I don't have the full buffer)
work/05861/tg851601/stampede2/test/code/rsmas_insar/minsar/submit_jobs.bash: line 53: step_max_task_list: bad array subscript
: number of running/pending tasks is 47 (maximum )
/scratch/05861/tg851601/KokoxiliBig37SenAT70/run_files/run_08_generate_burst_igram_10.job: 24 additional tasks
: 71 total tasks (maximum )
Number of running/pending jobs: 2
Couldnt submit job (/scratch/05861/tg851601/KokoxiliBig37SenAT70/run_files/run_08_generate_burst_igram_10.job), because there are 25 active jobs right now. Waiting 5 minutes to submit next job.
/work/05861/tg851601/stampede2/test/code/rsmas_insar/minsar/submit_jobs.bash: line 53: step_max_task_list: bad array subscript
: number of running/pending tasks is 24 (maximum )
/scratch/05861/tg851601/KokoxiliBig37SenAT70/run_files/run_08_generate_burst_igram_10.job: 24 additional tasks
: 48 total tasks (maximum )
Number of running/pending jobs: 1
Couldnt submit job (/scratch/05861/tg851601/KokoxiliBig37SenAT70/run_files/run_08_generate_burst_igram_10.job), because there are 25 active jobs right now. Waiting 5 minutes to submit next job.
/work/05861/tg851601/stampede2/test/code/rsmas_insar/minsar/submit_jobs.bash: line 53: step_max_task_list: bad array subscript
: number of running/pending tasks is 1 (maximum )
/scratch/05861/tg851601/KokoxiliBig37SenAT70/run_files/run_08_generate_burst_igram_10.job: 24 additional tasks
: 25 total tasks (maximum )
Number of running/pending jobs: 1
Couldnt submit job (/scratch/05861/tg851601/KokoxiliBig37SenAT70/run_files/run_08_generate_burst_igram_10.job), because there are 25 active jobs right now. Waiting 5 minutes to submit n
There is a problem if many jobs are submitted (17 in this case). Just submitting 3 jobs (run_02) worked fine
submit_jobs.bash /scratch/05861/tg851601/wd2KokoxiliBigChunk37SenDT48 --start 8
Jobfiles to run: /scratch/05861/tg851601/wd2KokoxiliBigChunk37SenDT48/run_files/run_08_generate_burst_igram_0.job /scratch/05861/tg851601/wd2KokoxiliBigChunk37SenDT48/run_files/run_08_generate_burst_igram_10.job /scratch/05861/tg851601/wd2KokoxiliBigChunk37SenDT48/run_files/run_08_generate_burst_igram_11.job /scratch/05861/tg851601/wd2KokoxiliBigChunk37SenDT48/run_files/run_08_generate_burst_igram_12.job /scratch/05861/tg851601/wd2KokoxiliBigChunk37SenDT48/run_files/run_08_generate_burst_igram_13.job /scratch/05861/tg851601/wd2KokoxiliBigChunk37SenDT48/run_files/run_08_generate_burst_igram_14.job /scratch/05861/tg851601/wd2KokoxiliBigChunk37SenDT48/run_files/run_08_generate_burst_igram_15.job /scratch/05861/tg851601/wd2KokoxiliBigChunk37SenDT48/run_files/run_08_generate_burst_igram_16.job /scratch/05861/tg851601/wd2KokoxiliBigChunk37SenDT48/run_files/run_08_generate_burst_igram_1.job /scratch/05861/tg851601/wd2KokoxiliBigChunk37SenDT48/run_files/run_08_generate_burst_igram_2.job /scratch/05861/tg851601/wd2KokoxiliBigChunk37SenDT48/run_files/run_08_generate_burst_igram_3.job /scratch/05861/tg851601/wd2KokoxiliBigChunk37SenDT48/run_files/run_08_generate_burst_igram_4.job /scratch/05861/tg851601/wd2KokoxiliBigChunk37SenDT48/run_files/run_08_generate_burst_igram_5.job /scratch/05861/tg851601/wd2KokoxiliBigChunk37SenDT48/run_files/run_08_generate_burst_igram_6.job /scratch/05861/tg851601/wd2KokoxiliBigChunk37SenDT48/run_files/run_08_generate_burst_igram_7.job /scratch/05861/tg851601/wd2KokoxiliBigChunk37SenDT48/run_files/run_08_generate_burst_igram_8.job /scratch/05861/tg851601/wd2KokoxiliBigChunk37SenDT48/run_files/run_08_generate_burst_igram_9.job
generate_burst_igram: number of running/pending tasks is 0 (maximum 500)
/scratch/05861/tg851601/wd2KokoxiliBigChunk37SenDT48/run_files/run_08_generate_burst_igram_0.job: 22 additional tasks
generate_burst_igram: 22 total tasks (maximum 500)
Number of running/pending jobs: 0
/work/05861/tg851601/stampede2/test/codedev2/rsmas_insar/minsar/submit_jobs.bash: line 53: step_max_task_list: bad array subscript
ç: number of running/pending tasks is 22 (maximum )
Just curious, does this only happen when attempting to run steps that have more than 10 jobs files associated with them? If so, I know what the problem is.
This could be. unitGalapagos which is only one job file works fine. I also tried others with 3 and 4 job files that worked fine as well
Fixed this.
done
The first three items are relatively urgent.
[x] 1. submit_jobs.bash: re-running fails when there are 25 jobs in the queue (25 has been replaced by $MAX_JOBS_PER_QUEUE). We need to implement waiting for a free slot.
[x] 2. submit_jobs.bash: We are allowed to use only 10 nodes for the
pairs_misreg
step (run_07) beacuse of too high IO load. So before submitting a job for this step (check forpairs_misreg
and not forrun_07
) it needs to check how many nodes are used (jobs may use multiple nodes).In the future there likely will be similar restrictions for other steps. So please write such that this can be easily added.
[x] 3. ssara_federated_query.bash: Before starting the
wget
jobs it needs to check how manywget
jobs are running. We are not allowed to run more than 5. I know that this is sort-of implemented inminsar_wrapper.bash
but this is not enough as we don't always use it. My account got twice suspended because of too manywget
jobs.[x] 4. Need a script for this: https://github.com/geodesymiami/rsmas_insar/issues/449
[ ] 5. check_services.py script: https://github.com/geodesymiami/rsmas_insar/issues/425