geodesymiami / rsmas_insar

RSMAS InSAR code
https://rsmas-insar.readthedocs.io/
GNU General Public License v3.0
58 stars 22 forks source link

run_workflow,bash: TIMEOUT rerunning not working for smallbaseline.job #504

Open falkamelung opened 2 years ago

falkamelung commented 2 years ago

When a smallbaseline.job times out it stalls instead of rerunning. For insarmaps.job the same applies. You can reproduce by changing the walltime in smallbaseline.job to 00:01:00 and run:

minsarApp.bash /work2/05861/tg851601/stampede2/code/rsmas_insar/samples/unittestGalapagosSenDT128.template --start mintpy
20211003:00-10 * minsarApp.bash /work2/05861/tg851601/stampede2/code/rsmas_insar/samples/unittestGalapagosSenDT128.template --start mintpy 
copy_to_tmp  is switched ON
Flags for processing steps:
download dem jobfiles ifgrams mintpy minopy upload insarmaps
    0     0      0      0        1     0      1       1
/scratch/05861/tg851601/unittestGalapagosSenDT128
download_dir: /scratch/05861/tg851601/unittestGalapagosSenDT128/SLC
Running.... run_workflow.bash /scratch/05861/tg851601/unittestGalapagosSenDT128 --append --dostep mintpy --tmp
This is the Open Source version of ISCE.
Some of the workflows depend on a separate licensed package.
To obtain the licensed package, please make a request for ISCE
through the website: https://download.jpl.nasa.gov/ops/request/index.cfm.
Alternatively, if you are a member, or can become a member of WinSAR
you may be able to obtain access to a version of the licensed sofware at
https://winsar.unavco.org/software/isce
checking *.e, *.o from /scratch/05861/tg851601/unittestGalapagosSenDT128/insarmaps.job
no error found

Started at: 2021-10-03 00:25:11
Jobfiles to run:
/scratch/05861/tg851601/unittestGalapagosSenDT128/smallbaseline_wrapper.job
---------------------------------------------------------------------------------------------------------------------------------------------------------
|                      |       | Step      | Total     | Step      |         |                                                                        |  
|                      | Extra | active    | active    | processed | Active  |                                                                        |  
| File Name            | tasks | tasks     | tasks     | jobs      | jobs    | Message                                                                |  
---------------------------------------------------------------------------------------------------------------------------------------------------------
| smallbasel...per.job | 1     | 1/400     | 20/500    | 1/1       | 0/1     | Submitted: 8543444                                                     |
---------------------------------------------------------------------------------------------------------------------------------------------------------
Jobs submitted: 8543444
unittestGalapagosSenDT128, smallbaseline_wrapper, 1 jobs : 0 COMPLETED , 1 RUNNING , 0 PENDING , 0 WAITING   .
unittestGalapagosSenDT128, smallbaseline_wrapper, 1 jobs : 0 COMPLETED , 1 RUNNING , 0 PENDING , 0 WAITING   .
unittestGalapagosSenDT128, smallbaseline_wrapper, 1 jobs : 0 COMPLETED , 1 RUNNING , 0 PENDING , 0 WAITING   .
unittestGalapagosSenDT128, smallbaseline_wrapper, 1 jobs : 0 COMPLETED , 1 RUNNING , 0 PENDING , 0 WAITING   .
/scratch/05861/tg851601/unittestGalapagosSenDT128/smallbaseline_wrapper.job timedout with walltime of 0:01:50.
Resubmitting file (/scratch/05861/tg851601/unittestGalapagosSenDT128/smallbaseline_wrapper.job) with new walltime of 00:02:12
Ovec8hkin commented 2 years ago

I am looking at this. Its a bug in submit_jobs.bash where its not properly getting the run_file from the *.job filename for insarmaps and smallbaseline_wrapper.

Ovec8hkin commented 2 years ago

I am presently unable to get smallbaseline_wrapper.job to timeout. I have lowered to the wall time to 0:00:05 and it still completes in time.

falkamelung commented 2 years ago

You probably have to remove the mintpy directory

Ovec8hkin commented 2 years ago

@falkamelung

Should the following files all be present in the project directory:

insarmaps.job
insarmaps
smallbaseline_wrapper.job
smallbaseline_wrapper

In the same way that each *.job file non the run_files/ directory has a dedicated file with no extension alongside it?

falkamelung commented 2 years ago

No. Without *job it should not be there. I don't know how you got this.

Ovec8hkin commented 2 years ago

Thats the thing. I don't have this. I was making sure that was how it is supposed to be.

falkamelung commented 2 years ago

There is no run_file associated with these two jobs. That is why they are in the project dir. They are straight produced by create_runfiles.py and don't come from ISCE

falkamelung commented 2 years ago

I sometimes remove the project dir and just run again:

minsarApp.bash $SAMPLESDIR/unittestGalapagosSenDT128.template --start dem
falkamelung commented 2 years ago

And I say queuedev or , on stampede, export QUEUENAME=skx_dev so that it goes fast

Ovec8hkin commented 2 years ago

Im pretty sure I fixed this. It was a bug in sbatch_conditional.bash where I wasn't properly handling providing full file paths for those two job files.

falkamelung commented 2 years ago

Sorry, I was away from from desk. Below what I get. It says it resubmits but it does not

Jobfiles to run:
/scratch/05861/tg851601/unittestGalapagosSenDT128/smallbaseline_wrapper.job
---------------------------------------------------------------------------------------------------------------------------------------------------------
|                      |       | Step      | Total     | Step      |         |                                                                        |  
|                      | Extra | active    | active    | processed | Active  |                                                                        |  
| File Name            | tasks | tasks     | tasks     | jobs      | jobs    | Message                                                                |  
---------------------------------------------------------------------------------------------------------------------------------------------------------
| smallbasel...per.job | 1     | 0/1200    | 6/2000    | 1/1       | 0/1     | Submitted: 8609835                                                     |
---------------------------------------------------------------------------------------------------------------------------------------------------------
Jobs submitted: 8609835
unittestGalapagosSenDT128, smallbaseline_wrapper, 1 jobs : 0 COMPLETED , 1 RUNNING , 0 PENDING , 0 WAITING   .
unittestGalapagosSenDT128, smallbaseline_wrapper, 1 jobs : 0 COMPLETED , 1 RUNNING , 0 PENDING , 0 WAITING   .
/scratch/05861/tg851601/unittestGalapagosSenDT128/smallbaseline_wrapper.job timedout with walltime of 0:00:55.
Resubmitting file (/scratch/05861/tg851601/unittestGalapagosSenDT128/smallbaseline_wrapper.job) with new walltime of 00:01:06
falkamelung commented 2 years ago

When I just runrun_workflow.bash the same thing:

run_workflow.bash /scratch/05861/tg851601/unittestGalapagosSenDT128 --start mintpy
Started at: 2021-10-16 16:53:39
Jobfiles to run:
/scratch/05861/tg851601/unittestGalapagosSenDT128/smallbaseline_wrapper.job
---------------------------------------------------------------------------------------------------------------------------------------------------------
|                      |       | Step      | Total     | Step      |         |                                                                        |  
|                      | Extra | active    | active    | processed | Active  |                                                                        |  
| File Name            | tasks | tasks     | tasks     | jobs      | jobs    | Message                                                                |  
---------------------------------------------------------------------------------------------------------------------------------------------------------
| smallbasel...per.job | 1     | 0/1200    | 6/2000    | 1/1       | 0/1     | Submitted: 8609848                                                     |
---------------------------------------------------------------------------------------------------------------------------------------------------------
Jobs submitted: 8609848
unittestGalapagosSenDT128, smallbaseline_wrapper, 1 jobs : 0 COMPLETED , 1 RUNNING , 0 PENDING , 0 WAITING   .
unittestGalapagosSenDT128, smallbaseline_wrapper, 1 jobs : 0 COMPLETED , 1 RUNNING , 0 PENDING , 0 WAITING   .
/scratch/05861/tg851601/unittestGalapagosSenDT128/smallbaseline_wrapper.job timedout with walltime of 00:00:55.
Resubmitting file (/scratch/05861/tg851601/unittestGalapagosSenDT128/smallbaseline_wrapper.job) with new walltime of 00:01:06
Ovec8hkin commented 2 years ago

I'm still testing. Will push tonight.

Ovec8hkin commented 2 years ago

Pushed

falkamelung commented 2 years ago

Great! It works. Thank you.! Only it says strange job state encountered. This is a bit scary. Is this by design or a bug?

Jobs submitted: 8610595
unittestGalapagosSenDT128, smallbaseline_wrapper, 1 jobs : 0 COMPLETED , 1 RUNNING , 0 PENDING , 0 WAITING   .
unittestGalapagosSenDT128, smallbaseline_wrapper, 1 jobs : 0 COMPLETED , 1 RUNNING , 0 PENDING , 0 WAITING   .
unittestGalapagosSenDT128, smallbaseline_wrapper, 1 jobs : 0 COMPLETED , 1 RUNNING , 0 PENDING , 0 WAITING   .
unittestGalapagosSenDT128, smallbaseline_wrapper, 1 jobs : 0 COMPLETED , 1 RUNNING , 0 PENDING , 0 WAITING   .
/scratch/05861/tg851601/unittestGalapagosSenDT128/smallbaseline_wrapper.job timedout with walltime of 00:01:19.
Resubmitting file (/scratch/05861/tg851601/unittestGalapagosSenDT128/smallbaseline_wrapper.job) with new walltime of 00:01:34
Resubmitted as jobumber: 8610603.
Strange job state: ----------, encountered.
unittestGalapagosSenDT128, smallbaseline_wrapper, 1 jobs : 0 COMPLETED , 0 RUNNING , 0 PENDING , 1 WAITING   .
unittestGalapagosSenDT128, smallbaseline_wrapper, 1 jobs : 0 COMPLETED , 1 RUNNING , 0 PENDING , 0 WAITING   .
unittestGalapagosSenDT128, smallbaseline_wrapper, 1 jobs : 0 COMPLETED , 1 RUNNING , 0 PENDING , 0 WAITING   .
unittestGalapagosSenDT128, smallbaseline_wrapper, 1 jobs : 1 COMPLETED , 0 RUNNING , 0 PENDING , 0 WAITING   .
check_job_outputs.py  /scratch/05861/tg851601/unittestGalapagosSenDT128/smallbaseline_wrapper.job --tmp
Ovec8hkin commented 2 years ago

This was originally a line met for debugging purposes, but I have retained so I can verify if anything wacky occurs. A job very briefly has no state associated with it immediately after being submitted, which is caught when resubmitted after TIMEOUT. It's consistent and not something to be concerned about.