geodesymiami / rsmas_insar

RSMAS InSAR code
https://rsmas-insar.readthedocs.io/
GNU General Public License v3.0
59 stars 23 forks source link

Items for Josh #450

Closed falkamelung closed 3 years ago

falkamelung commented 3 years ago

The first three items are relatively urgent.

Ovec8hkin commented 3 years ago

So requested options are: --max_jobs, --sleep, --job_name, and --max_nodes?

falkamelung commented 3 years ago

That is true. We could have a submit_jobs_limits.cfg in the /defaults directory, in a format similar to https://github.com/geodesymiami/rsmas_insar/blob/master/minsar/defaults/job_defaults.cfg

It will only have few entries, but can grow as needed. Eventually the cfg files will be merged. You could write the .cfg reader such that it will be easy to switch to a format similar to:

------------------------------------------------------------------------------------------------------
name                               c_walltime  s_walltime  c_memory s_memory  num_threads   max_tasks
-----------------------------------------------------------------------------------------------------
default                              02:00:00        0       3000       0         2            -

# topsStack

unpack_topo_reference                    0      00:01:00     4000       0         8            -
unpack_secondary_slc                     0      00:00:10     4000       0         2            500
average_baseline                         0      00:00:10     1000       0         2            -
extract_burst_overlaps                   0      00:00:10     4000       0         2            800
Ovec8hkin commented 3 years ago

@falkamelung There is no easy way to get a listing of active jobs by name from squeue. squeue -n can list jobs that exactly match the given name, but it doesn't support pattern matching (so I can't do squeue -n *pairs_misreg*). Please determine how to get the number of active nodes for ONLY jobs matching the desired naming pattern.

Additionally, please elaborate on the tasks computation. What is the maximum number, how do I compute the current number of active tasks?

falkamelung commented 3 years ago

Did you see https://github.com/geodesymiami/rsmas_insar/issues/450#issuecomment-757153866 ?

You grep the file name and then run wc -l . For nodes you could dom something similar but we better do tasks and I would think it works.

Ovec8hkin commented 3 years ago

Yes I saw the previous comment.

Is the number of tasks independent of the number of nodes utilized? What is the maximum number of tasks that can be submitted? How do I check the number of currently running tasks (reading all of the files isn't very efficient and seems over complicated)? Why can't we use nodes anymore?

falkamelung commented 3 years ago

The number of tasks running on each job varies depending on memory requirement. Sometimes all 48 cores are used and sometimes only 10. TACC asked me to limit the number of jobs but it is the number of simultaneous tasks which matters in terms of IO load.

The max_tasks depends on the run_0* step. For now just use 500.

Yes, it is not efficient. But I don't know how else to do it. Another option would be to store all the relevant job parameters in one file which is read by submit_jobs.bash, but this sounds more complicated.

falkamelung commented 3 years ago

We could generate a file with the number of task and read it at the beginning of submit_jobs.bash

wc -l  run_*_{0,1,2,3,4,5,6,,8,9} | sort -k2,2
wc: run_*_: No such file or directory
     1 run_01_unpack_topo_reference_0
    37 run_02_unpack_secondary_slc_0
    37 run_02_unpack_secondary_slc_1
    37 run_02_unpack_secondary_slc_2
    34 run_02_unpack_secondary_slc_3
    37 run_03_average_baseline_0
    37 run_03_average_baseline_1
    37 run_03_average_baseline_2
    34 run_03_average_baseline_3
    21 run_04_fullBurst_geo2rdr_0
    21 run_04_fullBurst_geo2rdr_1
    21 run_04_fullBurst_geo2rdr_2
    21 run_04_fullBurst_geo2rdr_3
    21 run_04_fullBurst_geo2rdr_4
    21 run_04_fullBurst_geo2rdr_5
    19 run_04_fullBurst_geo2rdr_6
    21 run_05_fullBurst_resample_0
    21 run_05_fullBurst_resample_1
    21 run_05_fullBurst_resample_2
    21 run_05_fullBurst_resample_3
    21 run_05_fullBurst_resample_4
    21 run_05_fullBurst_resample_5
    19 run_05_fullBurst_resample_6
     1 run_06_extract_stack_valid_region_0
    38 run_07_merge_reference_secondary_slc_0
    38 run_07_merge_reference_secondary_slc_1
    38 run_07_merge_reference_secondary_slc_2
    38 run_07_merge_reference_secondary_slc_3
    48 run_08_generate_burst_igram_0
    48 run_08_generate_burst_igram_1
    48 run_08_generate_burst_igram_2
    48 run_08_generate_burst_igram_3
    48 run_08_generate_burst_igram_4
    48 run_08_generate_burst_igram_5
    48 run_08_generate_burst_igram_6
    48 run_08_generate_burst_igram_8
    48 run_08_generate_burst_igram_9
    48 run_09_merge_burst_igram_0
    48 run_09_merge_burst_igram_1
    48 run_09_merge_burst_igram_2
    48 run_09_merge_burst_igram_3
    48 run_09_merge_burst_igram_4
    48 run_09_merge_burst_igram_5
    48 run_09_merge_burst_igram_6
    48 run_09_merge_burst_igram_8
    48 run_09_merge_burst_igram_9
    48 run_10_filter_coherence_0
    48 run_10_filter_coherence_1
    48 run_10_filter_coherence_2
    48 run_10_filter_coherence_3
    48 run_10_filter_coherence_4
    48 run_10_filter_coherence_5
    48 run_10_filter_coherence_6
    48 run_10_filter_coherence_8
    48 run_10_filter_coherence_9
    48 run_11_unwrap_0
    48 run_11_unwrap_1
    48 run_11_unwrap_2
    48 run_11_unwrap_3
    48 run_11_unwrap_4
    48 run_11_unwrap_5
    48 run_11_unwrap_6
    48 run_11_unwrap_8
    48 run_11_unwrap_9
Ovec8hkin commented 3 years ago

Ok. 1 and 2 are done (not pushed yet).

How do I check the number of active wget processes running?

Ovec8hkin commented 3 years ago

re 3: What is a valid command line call to ssara_federated_query.bash? I don't know what the command line options are. Send examples for both "unittestGalapagosDT128" and "KilaueaSenAT124".

falkamelung commented 3 years ago

For Kilauea:

ssara_federated_query.bash --platform=SENTINEL-1A,SENTINEL-1B --relativeOrbit=124 --intersectsWith='Polygon((-155.80 19.00, -155.80 20.00, -154.50 20.00, -154.50 19.00, -155.80 19.00))' --end=2019-12-31 --parallel=5 --print --download

unittestGalapagos does not do any download, but $SAMPLESDIR/GalapagosSenDT128.template does (it is the same as unittestGalapagos but with download)

minsarApp.bash $SAMPLESDIR/GalapagosSenDT128.template

The download command is in ssara_command.txt

Ovec8hkin commented 3 years ago

Ok. Is it simple enough to run ps aux | grep "wget" | wc -l to determine number of active wget processes?

falkamelung commented 3 years ago

I think so. You have to remove the timeout processes for counting.

Ovec8hkin commented 3 years ago

Is there a reason we don't do something like this:

for u in "${urls[@]}"; do
     timeout $timeout wget --continue --user $user --password $passwd $u
done

but instead run the wget command through xargs? I know initially it was because xargs allows us to run wget in parallel, but it seems like a simple for loop would let us do the same thing (it would just spawn additional, independent wget processes). I don't think we can use the xargs syntax in conjunction with wget limits, as the xargs syntax submits all of the wget requests at once and just waits for them to become active.

falkamelung commented 3 years ago

I think so. The xargs command worked, but my impression was that it does not give the flexibility we need.

Ovec8hkin commented 3 years ago

Ok. I will translate to the above for loop syntax then. Do we need to check the exit status of each and every wget process for failure? Thats becomes complicated when lots of processes are running in parallel.

falkamelung commented 3 years ago

Hmmm.... The wget failures were always a problem but lately downloads worked fine. I am not sure whether this is because the data source was stable or because of our code. I suspect because of the first. I am not positive that our code works. Even when downloading worked fine, I see messages like:

20210114:13-01 * ssara_federated_query.bash --platform=SENTINEL-1A,SENTINEL-1B --relativeOrbit=124 --intersectsWith='Polygon((-155.80 19.00, -155.80 20.00, -154.50 20.00, -154.50 19.00, -155.80 19.00))' --end=2019-12-31 --parallel=5 --print --download 
20210114:13-01 * ssara_federated_query.bash --platform=SENTINEL-1A,SENTINEL-1B --relativeOrbit=124 --intersectsWith='Polygon((-155.80 19.00, -155.80 20.00, -154.50 20.00, -154.50 19.00, -155.80 19.00))' --end=2019-12-31 --parallel=5 --print --download 
20210114:13-01 * Something went wrong. Exit code was 123. Trying again with  second timeout

So, we probably can get rid of the exit_code checking, but can we have another way to check whether it was successful? Maybe the --spider option can be used for this: https://stackoverflow.com/questions/6264726/can-i-use-wget-to-check-but-not-download

Also could be that the exit code checking was linked to the timeout command.

We also have a script check_downloads.py --delete which I use when things don't work. This could be run inside ssara_federated_query.bash. I also was considering to run this in minsarApp.bash.

falkamelung commented 3 years ago

I probably need to take back that the exit_code checking is not working. I do see some downloads where ssara was run again. I might have done manually, but I don't think so.

cat KokoxiliBigSenAT41/SLC/log
20210111:112010 * ssara_federated_query.py --platform=SENTINEL-1A,SENTINEL-1B --relativeOrbit=41 --intersectsWith='Polygon((87.00 29.50, 87.00 42.50, 98.00 42.50, 98.00 29.50, 87.00 29.50))' --start=2014-01-01 --kml --maxResults=20000
20210111:11-01 * ssara_federated_query.bash --platform=SENTINEL-1A,SENTINEL-1B --relativeOrbit=41 --intersectsWith='Polygon((87.00 29.50, 87.00 42.50, 98.00 42.50, 98.00 29.50, 87.00 29.50))' --start=2014-01-01 --parallel=5 --print --download 
20210111:14-01 * Something went wrong. Exit code was 123. Trying again with  second timeout
20210111:14-01 * Something went wrong. Exit code was 123. Trying again with  second timeout
20210111:14-01 * ssara_federated_query.bash --platform=SENTINEL-1A,SENTINEL-1B --relativeOrbit=41 --intersectsWith='Polygon((87.00 29.50, 87.00 42.50, 98.00 42.50, 98.00 29.50, 87.00 29.50))' --start=2014-01-01 --parallel=5 --print --download 
20210111:21-01 * Something went wrong. Exit code was 123. Trying again with  second timeout
20210112:05-01 * ssara_federated_query.bash --platform=SENTINEL-1A,SENTINEL-1B --relativeOrbit=41 --intersectsWith='Polygon((87.00 29.50, 87.00 42.50, 98.00 42.50, 98.00 29.50, 87.00 29.50))' --start=2014-01-01 --parallel=5 --print --download 
cat ManyiBigSenAT12/SLC/log
20210110:223210 * ssara_federated_query.py --platform=SENTINEL-1A,SENTINEL-1B --relativeOrbit=12 --intersectsWith='Polygon((80.00 29.50, 80.00 41.50, 94.00 41.50, 94.00 29.50, 80.00 29.50))' --start=2014-01-01 --kml --maxResults=20000
20210110:22-01 * ssara_federated_query.bash --platform=SENTINEL-1A,SENTINEL-1B --relativeOrbit=12 --intersectsWith='Polygon((80.00 29.50, 80.00 41.50, 94.00 41.50, 94.00 29.50, 80.00 29.50))' --start=2014-01-01 --parallel=5 --print --download 
20210110:22-01 * Something went wrong. Exit code was 123. Trying again with  second timeout
20210110:22-01 * Something went wrong. Exit code was 123. Trying again with  second timeout
20210110:22-01 * ssara_federated_query.bash --platform=SENTINEL-1A,SENTINEL-1B --relativeOrbit=12 --intersectsWith='Polygon((80.00 29.50, 80.00 41.50, 94.00 41.50, 94.00 29.50, 80.00 29.50))' --start=2014-01-01 --parallel=5 --print --download 
20210110:22-01 * Something went wrong. Exit code was 123. Trying again with  second timeout
20210110:22-01 * Something went wrong. Exit code was 123. Trying again with  second timeout
20210110:22-01 * ssara_federated_query.bash --platform=SENTINEL-1A,SENTINEL-1B --relativeOrbit=12 --intersectsWith='Polygon((80.00 29.50, 80.00 41.50, 94.00 41.50, 94.00 29.50, 80.00 29.50))' --start=2014-01-01 --parallel=5 --print --download 
20210111:08-01 * Something went wrong. Exit code was 123. Trying again with  second timeout

Proper checking for download success and re-downloading remains a key feature of our software.

Ovec8hkin commented 3 years ago

I found a way to link the wget PID to each individual download file, so I should be able to check whether each PID exited cleanly or not, and if not redo the failed download.

Ovec8hkin commented 3 years ago

Ok. I think I got it figured out now (some test are running currently to check). What I ended up doing is batching the list of download URLs into 5-element groups and passing each group of 5 to wget throughout xargs to run in parallel (5 at a time). This ensures that only 5 wget processes are running at any one time, and that the processes are running in parallel. Once one group of five finishes, it yields and exit code indicating success or failure of the whole group, which I can then use to rerun is necessary. Once a group of 5 is finishes completely, without errors, it moves to the next batch.

This is the easiest way of doing this while keeping parallel download capabilities and the ability to rerun on failure. However, this does not guarantee that 5 wget processes are always running at any given time, as a whole batch of 5 are started at once, but may finish at different times, and a new set of wget processes aren't submitted until all 5 are finished.

Ovec8hkin commented 3 years ago

1, 2, 3 are finished and pushed. Please verify proper outputs.

falkamelung commented 3 years ago

Great! Thank you so much !

A few items as I will be with this script for a long time and expand on it:

  1. Can you make it a bit more verbose? The 18 - 500 - 6 is probably something on the number of tasks. It should show something like average_baseline: number of running/pending tasks 200

  2. "Maximum number of nodes for step (fullBurst_geo2rdr) is 400” is probably "tasks”

  3. when it can’t submit because there are too many tasks running/pending, can you show something similar to the MAX_JOBS? Similar to: echo "Couldnt submit job (${file}), because there are $MAX_JOBS_PER_QUEUE active jobs right now. Waiting 5 minutes to submit next job."

  4. when it can’t submit because there are too many tasks running/pending, can you show something similar to the MAX_JOBS? Similar to: echo "Couldnt submit job (${file}), because there are $MAX_JOBS_PER_QUEUE active jobs right now. Waiting 5 minutes to submit next job."

  5. when the step_tasks are set, can you do with line continuations? I forgot the bash syntax, something like

step_tasks=(                 \
[unpack_topo_reference]=500  \
[unpack_secondary_slc]=500   \
[average_baseline]=500       \
[extract_burst_overlaps]=500 \
[overlap_geo2rdr]=70         \
[filter_coherence]=500       \ 
…
[unwrap]=500                 \
)

We actually should use step_max_task_list (or step_max_task_array) and for one element step_maxtasks

  1. Let’s use python-style variable naming: runningtasks —> running_tasks

  2. Let’s try to use very similar variable names for the tasks as for the jobs:

running_tasks —> num_active_tasks (tasks for pending jobs are also counted)

  1. The grep of job_tasks does not seem to work. In the example below it should find"
    wc -l /scratch/05861/tg851601/qqunittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr
    6 /scratch/05861/tg851601/qqunittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr
18 - 500 - 6
Maximum number of nodes for step (average_baseline) is 500
Number of running/pending jobs: 20
Jobs submitted: 7131633
qqunittestGalapagosSenDT128 run_03_average_baseline_0.job, 7131633 is not finished yet. Current state is 'PENDING '
qqunittestGalapagosSenDT128 run_03_average_baseline_0.job, 7131633 is not finished yet. Current state is 'PENDING '
qqunittestGalapagosSenDT128 run_03_average_baseline_0.job, 7131633 is not finished yet. Current state is 'PENDING '
qqunittestGalapagosSenDT128 run_03_average_baseline_0.job, 7131633 is not finished yet. Current state is 'PENDING '
qqunittestGalapagosSenDT128 run_03_average_baseline_0.job, 7131633 is not finished yet. Current state is 'PENDING '
qqunittestGalapagosSenDT128 run_03_average_baseline_0.job, 7131633 is not finished yet. Current state is 'PENDING '
qqunittestGalapagosSenDT128 run_03_average_baseline_0.job, 7131633 is not finished yet. Current state is 'PENDING '
qqunittestGalapagosSenDT128 run_03_average_baseline_0.job, 7131633 is not finished yet. Current state is 'PENDING '
qqunittestGalapagosSenDT128 run_03_average_baseline_0.job, 7131633 is not finished yet. Current state is 'PENDING '
qqunittestGalapagosSenDT128 run_03_average_baseline_0.job, 7131633 is not finished yet. Current state is 'PENDING '
qqunittestGalapagosSenDT128 run_03_average_baseline_0.job, 7131633 is not finished yet. Current state is 'PENDING '
qqunittestGalapagosSenDT128 run_03_average_baseline_0.job, 7131633 is not finished yet. Current state is 'PENDING '
qqunittestGalapagosSenDT128 run_03_average_baseline_0.job, 7131633 is not finished yet. Current state is 'PENDING '
qqunittestGalapagosSenDT128 run_03_average_baseline_0.job, 7131633 is not finished yet. Current state is 'PENDING '
qqunittestGalapagosSenDT128 run_03_average_baseline_0.job, 7131633 is not finished yet. Current state is 'RUNNING '
7131633 is complete
Job7131633 is finished.
check_job_outputs.py  /scratch/05861/tg851601/qqunittestGalapagosSenDT128/run_files/run_03_average_baseline_0.job
This is the Open Source version of ISCE.
Some of the workflows depend on a separate licensed package.
To obtain the licensed package, please make a request for ISCE
through the website: https://download.jpl.nasa.gov/ops/request/index.cfm.
Alternatively, if you are a member, or can become a member of WinSAR
you may be able to obtain access to a version of the licensed sofware at
https://winsar.unavco.org/software/isce
checking:  /scratch/05861/tg851601/qqunittestGalapagosSenDT128/run_files/run_03_average_baseline_0.job
no error found
Jobfiles to run: /scratch/05861/tg851601/qqunittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr_0.job
cat: /scratch/05861/tg851601/qqunittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr_2: No such file or directory
cat: /scratch/05861/tg851601/qqunittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr_3: No such file or directory
cat: /scratch/05861/tg851601/qqunittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr_4: No such file or directory
cat: /scratch/05861/tg851601/qqunittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr_5: No such file or directory
cat: /scratch/05861/tg851601/qqunittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr_6: No such file or directory
cat: /scratch/05861/tg851601/qqunittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr_1: No such file or directory
cat: /scratch/05861/tg851601/qqunittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr_2: No such file or directory
cat: /scratch/05861/tg851601/qqunittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr_3: No such file or directory
cat: /scratch/05861/tg851601/qqunittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr_4: No such file or directory
cat: /scratch/05861/tg851601/qqunittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr_5: No such file or directory
cat: /scratch/05861/tg851601/qqunittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr_6: No such file or directory
cat: /scratch/05861/tg851601/qqunittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr_5: No such file or directory
cat: /scratch/05861/tg851601/qqunittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr_2: No such file or directory
cat: /scratch/05861/tg851601/qqunittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr_4: No such file or directory
cat: /scratch/05861/tg851601/qqunittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr_3: No such file or directory
cat: /scratch/05861/tg851601/qqunittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr_1: No such file or directory
cat: /scratch/05861/tg851601/qqunittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr_1: No such file or directory
24 - 400 - 6
Maximum number of nodes for step (fullBurst_geo2rdr) is 400
Number of running/pending jobs: 20
Jobs submitted: 7131665
qqunittestGalapagosSenDT128 run_04_fullBurst_geo2rdr_0.job, 7131665 is not finished yet. Current state is ''
qqunittestGalapagosSenDT128 run_04_fullBurst_geo2rdr_0.job, 7131665 is not finished yet. Current state is 'PENDING '
qqunittestGalapagosSenDT128 run_04_fullBurst_geo2rdr_0.job, 7131665 is not finished yet. Current state is 'PENDING '
qqunittestGalapagosSenDT128 run_04_fullBurst_geo2rdr_0.job, 7131665 is not finished yet. Current state is 'PENDING '
qqunittestGalapagosSenDT128 run_04_fullBurst_geo2rdr_0.job, 7131665 is not finished yet. Current state is 'PENDING '
Ovec8hkin commented 3 years ago

The above all sound like logging updates which should be easily implemented. Is this indication that the script work as intended?

falkamelung commented 3 years ago

No, I don't think it works. I don't think it properly gets the number of active tasks but this is hard to tell with the current logging. These messages look like resulting from a bug:

Jobfiles to run: /scratch/05861/tg851601/qqunittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr_0.job
cat: /scratch/05861/tg851601/qqunittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr_2: No such file or directory
cat: /scratch/05861/tg851601/qqunittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr_3: No such file or directory
cat: /scratch/05861/tg851601/qqunittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr_4: No such file or directory

Once the messaging/logging is more meaningful I can say whether it works. How about you first fix these items and then I try again. If there are issues remaining I can help you to run multiple workflows simultaneously if you have not done that.

falkamelung commented 3 years ago

I just want to share this one. To test the new feature I need to have jobs running like shown here. When I try to submit new jobs it should the number of active tasks. To test I will temporarily reduce step_max_tasks and see whether it properly calculates the number of active tasks.

 sjobs
             JOBID   PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           7132424  skx-normal run_10_f tg851601 CG       5:27      5 c486-[042,062,091,103,113]
           7132425  skx-normal run_10_f tg851601 CG       5:17      5 c487-024,c488-[001,011,062,092]
           7132426  skx-normal run_10_f tg851601 CG       5:22      5 c489-014,c491-[103-104],c506-[104,111]
           7132427  skx-normal run_10_f tg851601 PD       0:00      5 (Priority)
           7132428  skx-normal run_10_f tg851601 PD       0:00      5 (Priority)
           7132429  skx-normal run_10_f tg851601 PD       0:00      5 (Priority)
           7132430  skx-normal run_10_f tg851601 PD       0:00      5 (Priority)
           7132431  skx-normal run_10_f tg851601 PD       0:00      5 (Priority)
           7132432  skx-normal run_10_f tg851601 PD       0:00      5 (Priority)
           7132433  skx-normal run_10_f tg851601 PD       0:00      5 (Priority)
           7132434  skx-normal run_10_f tg851601 PD       0:00      5 (Priority)
           7132775  skx-normal run_08_g tg851601 PD       0:00      3 (Priority)
           7132776  skx-normal run_08_g tg851601 PD       0:00      3 (Priority)
           7132777  skx-normal run_08_g tg851601 PD       0:00      3 (Priority)
           7132778  skx-normal run_08_g tg851601 PD       0:00      3 (Priority)
           7132983  skx-normal run_08_g tg851601 PD       0:00      3 (Priority)
           7133116  skx-normal run_08_g tg851601 PD       0:00      3 (Priority)
           7132423  skx-normal run_10_f tg851601  R       5:27      5 c484-072,c485-[061,114],c486-[011,023]
falkamelung commented 3 years ago

And once this works we have to address https://github.com/geodesymiami/rsmas_insar/issues/450#issuecomment-759107761. , which is to limit the total load on the system.

I do have a plan now. For this we need to get the number of all active tasks. Would that work?

We define for each step an io_load. The steps with heavy IO get a 1 and those t=with little IO get a 0 or 0.1:

io_load_step1 = 1
io_load_step2 = 0.2
io_load_step4 = 1
io_lload_step5 = 1
io_load_step6 = 0.5
io_load=step7 = 0.01

Before submitting a new job we calculate total_load:

total_load = num_active_tasks_step1*io_load_step1 +  num_active_tasks_step2*io_load_step2 + num_active_tasks_step3*io_load_step3

We then keep total_load below a prescribed max_total_load. For example, for max_total_load=1000 we can have 400 step4 tasks and 600 step5 tasks running. Alternatively we can have 500 step5 and 1000 step6 tasks running.

This should solve the problem. Once we have implemented this I will work with TACC to get numbers for io_load for each step.

Ovec8hkin commented 3 years ago

Make the above a separate issue.

Ovec8hkin commented 3 years ago

I pushed logging updates just now. Please verify things are working. It looks like I may have made a small error in the tasks computation earlier, but it should be fine now. The logging I see checks out with the by-hand calculations I did.

falkamelung commented 3 years ago

It is still not working.

When I submitted there was 1 running and 1 pending run_04 jobs (with 23 and 19 tasks). It should say that there are 42 running/pending tasks but it says 0 running/pending tasks

//login3/scratch/05861/tg851601/unittestGalapagosSenDT128[1035] sjobs
             JOBID   PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           7134453     skx-dev run_04_f tg851601 PD       0:00      1 (Priority)
           7133741  skx-normal run_04_f tg851601 PD       0:00      1 (Priority)
           7134008  skx-normal run_06_e tg851601 PD       0:00      1 (Priority)
           7134133  skx-normal smallbas tg851601 PD       0:00      1 (Priority)
           7133740  skx-normal run_04_f tg851601  R    1:04:11      1 c506-132

submit_jobs.bash $PWD --start 4

Jobfiles to run: /scratch/05861/tg851601/unittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr_0.job
cat: /scratch/05861/tg851601/unittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr_5: No such file or directory
cat: /scratch/05861/tg851601/unittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr_4: No such file or directory
fullBurst_geo2rdr: number of running/pending tasks is 0 (maximum 500)
/scratch/05861/tg851601/unittestGalapagosSenDT128/run_files/run_04_fullBurst_geo2rdr_0.job: 6 additional tasks
fullBurst_geo2rdr: 6 total tasks (maximum 500)
Number of running/pending jobs: 4
Jobs submitted: 7134461
unittestGalapagosSenDT128 run_04_fullBurst_geo2rdr_0.job, 7134461 is not finished yet. Current state is 'PENDING '
unittestGalapagosSenDT128 run_04_fullBurst_geo2rdr_0.job, 7134461 is not finished yet. Current state is 'PENDING '
unittestGalapagosSenDT128 run_04_fullBurst_geo2rdr_0.job, 7134461 is not finished yet. Current state is 'PENDING '
//login3/scratch/05861/tg851601/unittestGalapagosSenDT128[1038] scontrolddp 7133740
   Command=/scratch/05861/tg851601/KokoxiliBigChunk33SenDT48/run_files/run_04_fullBurst_geo2rdr_4.job
//login3/scratch/05861/tg851601/unittestGalapagosSenDT128[1039] wc -l /scratch/05861/tg851601/KokoxiliBigChunk33SenDT48/run_files/run_04_fullBurst_geo2rdr_4
23 /scratch/05861/tg851601/KokoxiliBigChunk33SenDT48/run_files/run_04_fullBurst_geo2rdr_4
scontrolddp 7133741
   Command=/scratch/05861/tg851601/KokoxiliBigChunk33SenDT48/run_files/run_04_fullBurst_geo2rdr_5.job
//login3/scratch/05861/tg851601/unittestGalapagosSenDT128[1042] wc -l /scratch/05861/tg851601/KokoxiliBigChunk33SenDT48/run_files/run_04_fullBurst_geo2rdr_5
19 /scratch/05861/tg851601/KokoxiliBigChunk33SenDT48/run_files/run_04_fullBurst_geo2rdr_5
Ovec8hkin commented 3 years ago

I can't replicate that behavior. How do you generate more than 1 run file for each job step for the unittestGalapagos dataset?

falkamelung commented 3 years ago

Just run Kilauea if the data has downloaded? Or make a copy of unittestGalapagos to qqunittestGalapagos and run that you have something running? I can start for you but it should be now as I have to leave in 20 minutes

Ovec8hkin commented 3 years ago

I don't get an issue running Kilauea. The task computations are correct. Are you processing multiple dataset concurrently maybe? I haven't tried that.

falkamelung commented 3 years ago

I had 3 workflows running

Ovec8hkin commented 3 years ago

Ok. I see the problem I think.

Ovec8hkin commented 3 years ago

Can confirm I know what the problem is. The task computation assumed that all actively running jobs had runfiles in the same directory, which meant that I didn't have to look up job info through scontrol. Ill probably get to fixing this either late tonight or tomorrow afternoon.

Ovec8hkin commented 3 years ago

This should be fixed now.

falkamelung commented 3 years ago

Hi @Ovec8hkin , Unfortunately there is still a bug. It happens when it reaches the 25 job limit (I think, unfortunately I don't have the full buffer)

work/05861/tg851601/stampede2/test/code/rsmas_insar/minsar/submit_jobs.bash: line 53: step_max_task_list: bad array subscript
: number of running/pending tasks is 47 (maximum )
/scratch/05861/tg851601/KokoxiliBig37SenAT70/run_files/run_08_generate_burst_igram_10.job: 24 additional tasks
: 71 total tasks (maximum )
Number of running/pending jobs: 2
Couldnt submit job (/scratch/05861/tg851601/KokoxiliBig37SenAT70/run_files/run_08_generate_burst_igram_10.job), because there are 25 active jobs right now. Waiting 5 minutes to submit next job.
/work/05861/tg851601/stampede2/test/code/rsmas_insar/minsar/submit_jobs.bash: line 53: step_max_task_list: bad array subscript
: number of running/pending tasks is 24 (maximum )
/scratch/05861/tg851601/KokoxiliBig37SenAT70/run_files/run_08_generate_burst_igram_10.job: 24 additional tasks
: 48 total tasks (maximum )
Number of running/pending jobs: 1
Couldnt submit job (/scratch/05861/tg851601/KokoxiliBig37SenAT70/run_files/run_08_generate_burst_igram_10.job), because there are 25 active jobs right now. Waiting 5 minutes to submit next job.
/work/05861/tg851601/stampede2/test/code/rsmas_insar/minsar/submit_jobs.bash: line 53: step_max_task_list: bad array subscript
: number of running/pending tasks is 1 (maximum )
/scratch/05861/tg851601/KokoxiliBig37SenAT70/run_files/run_08_generate_burst_igram_10.job: 24 additional tasks
: 25 total tasks (maximum )
Number of running/pending jobs: 1
Couldnt submit job (/scratch/05861/tg851601/KokoxiliBig37SenAT70/run_files/run_08_generate_burst_igram_10.job), because there are 25 active jobs right now. Waiting 5 minutes to submit n
falkamelung commented 3 years ago

There is a problem if many jobs are submitted (17 in this case). Just submitting 3 jobs (run_02) worked fine

submit_jobs.bash /scratch/05861/tg851601/wd2KokoxiliBigChunk37SenDT48 --start 8
Jobfiles to run: /scratch/05861/tg851601/wd2KokoxiliBigChunk37SenDT48/run_files/run_08_generate_burst_igram_0.job /scratch/05861/tg851601/wd2KokoxiliBigChunk37SenDT48/run_files/run_08_generate_burst_igram_10.job /scratch/05861/tg851601/wd2KokoxiliBigChunk37SenDT48/run_files/run_08_generate_burst_igram_11.job /scratch/05861/tg851601/wd2KokoxiliBigChunk37SenDT48/run_files/run_08_generate_burst_igram_12.job /scratch/05861/tg851601/wd2KokoxiliBigChunk37SenDT48/run_files/run_08_generate_burst_igram_13.job /scratch/05861/tg851601/wd2KokoxiliBigChunk37SenDT48/run_files/run_08_generate_burst_igram_14.job /scratch/05861/tg851601/wd2KokoxiliBigChunk37SenDT48/run_files/run_08_generate_burst_igram_15.job /scratch/05861/tg851601/wd2KokoxiliBigChunk37SenDT48/run_files/run_08_generate_burst_igram_16.job /scratch/05861/tg851601/wd2KokoxiliBigChunk37SenDT48/run_files/run_08_generate_burst_igram_1.job /scratch/05861/tg851601/wd2KokoxiliBigChunk37SenDT48/run_files/run_08_generate_burst_igram_2.job /scratch/05861/tg851601/wd2KokoxiliBigChunk37SenDT48/run_files/run_08_generate_burst_igram_3.job /scratch/05861/tg851601/wd2KokoxiliBigChunk37SenDT48/run_files/run_08_generate_burst_igram_4.job /scratch/05861/tg851601/wd2KokoxiliBigChunk37SenDT48/run_files/run_08_generate_burst_igram_5.job /scratch/05861/tg851601/wd2KokoxiliBigChunk37SenDT48/run_files/run_08_generate_burst_igram_6.job /scratch/05861/tg851601/wd2KokoxiliBigChunk37SenDT48/run_files/run_08_generate_burst_igram_7.job /scratch/05861/tg851601/wd2KokoxiliBigChunk37SenDT48/run_files/run_08_generate_burst_igram_8.job /scratch/05861/tg851601/wd2KokoxiliBigChunk37SenDT48/run_files/run_08_generate_burst_igram_9.job
generate_burst_igram: number of running/pending tasks is 0 (maximum 500)
/scratch/05861/tg851601/wd2KokoxiliBigChunk37SenDT48/run_files/run_08_generate_burst_igram_0.job: 22 additional tasks
generate_burst_igram: 22 total tasks (maximum 500)
Number of running/pending jobs: 0
/work/05861/tg851601/stampede2/test/codedev2/rsmas_insar/minsar/submit_jobs.bash: line 53: step_max_task_list: bad array subscript
ç: number of running/pending tasks is 22 (maximum )
Ovec8hkin commented 3 years ago

Just curious, does this only happen when attempting to run steps that have more than 10 jobs files associated with them? If so, I know what the problem is.

falkamelung commented 3 years ago

This could be. unitGalapagos which is only one job file works fine. I also tried others with 3 and 4 job files that worked fine as well

Ovec8hkin commented 3 years ago

Fixed this.

falkamelung commented 3 years ago

done