Open falkamelung opened 3 years ago
I am looking at this. Its a bug in submit_jobs.bash where its not properly getting the run_file from the *.job filename for insarmaps and smallbaseline_wrapper.
I am presently unable to get smallbaseline_wrapper.job to timeout. I have lowered to the wall time to 0:00:05 and it still completes in time.
You probably have to remove the mintpy
directory
@falkamelung
Should the following files all be present in the project directory:
insarmaps.job
insarmaps
smallbaseline_wrapper.job
smallbaseline_wrapper
In the same way that each *.job file non the run_files/ directory has a dedicated file with no extension alongside it?
No. Without *job it should not be there. I don't know how you got this.
Thats the thing. I don't have this. I was making sure that was how it is supposed to be.
There is no run_file associated with these two jobs. That is why they are in the project dir. They are straight produced by create_runfiles.py and don't come from ISCE
I sometimes remove the project dir and just run again:
minsarApp.bash $SAMPLESDIR/unittestGalapagosSenDT128.template --start dem
And I say queuedev
or , on stampede, export QUEUENAME=skx_dev so that it goes fast
Im pretty sure I fixed this. It was a bug in sbatch_conditional.bash where I wasn't properly handling providing full file paths for those two job files.
Sorry, I was away from from desk. Below what I get. It says it resubmits but it does not
Jobfiles to run:
/scratch/05861/tg851601/unittestGalapagosSenDT128/smallbaseline_wrapper.job
---------------------------------------------------------------------------------------------------------------------------------------------------------
| | | Step | Total | Step | | |
| | Extra | active | active | processed | Active | |
| File Name | tasks | tasks | tasks | jobs | jobs | Message |
---------------------------------------------------------------------------------------------------------------------------------------------------------
| smallbasel...per.job | 1 | 0/1200 | 6/2000 | 1/1 | 0/1 | Submitted: 8609835 |
---------------------------------------------------------------------------------------------------------------------------------------------------------
Jobs submitted: 8609835
unittestGalapagosSenDT128, smallbaseline_wrapper, 1 jobs : 0 COMPLETED , 1 RUNNING , 0 PENDING , 0 WAITING .
unittestGalapagosSenDT128, smallbaseline_wrapper, 1 jobs : 0 COMPLETED , 1 RUNNING , 0 PENDING , 0 WAITING .
/scratch/05861/tg851601/unittestGalapagosSenDT128/smallbaseline_wrapper.job timedout with walltime of 0:00:55.
Resubmitting file (/scratch/05861/tg851601/unittestGalapagosSenDT128/smallbaseline_wrapper.job) with new walltime of 00:01:06
When I just runrun_workflow.bash
the same thing:
run_workflow.bash /scratch/05861/tg851601/unittestGalapagosSenDT128 --start mintpy
Started at: 2021-10-16 16:53:39
Jobfiles to run:
/scratch/05861/tg851601/unittestGalapagosSenDT128/smallbaseline_wrapper.job
---------------------------------------------------------------------------------------------------------------------------------------------------------
| | | Step | Total | Step | | |
| | Extra | active | active | processed | Active | |
| File Name | tasks | tasks | tasks | jobs | jobs | Message |
---------------------------------------------------------------------------------------------------------------------------------------------------------
| smallbasel...per.job | 1 | 0/1200 | 6/2000 | 1/1 | 0/1 | Submitted: 8609848 |
---------------------------------------------------------------------------------------------------------------------------------------------------------
Jobs submitted: 8609848
unittestGalapagosSenDT128, smallbaseline_wrapper, 1 jobs : 0 COMPLETED , 1 RUNNING , 0 PENDING , 0 WAITING .
unittestGalapagosSenDT128, smallbaseline_wrapper, 1 jobs : 0 COMPLETED , 1 RUNNING , 0 PENDING , 0 WAITING .
/scratch/05861/tg851601/unittestGalapagosSenDT128/smallbaseline_wrapper.job timedout with walltime of 00:00:55.
Resubmitting file (/scratch/05861/tg851601/unittestGalapagosSenDT128/smallbaseline_wrapper.job) with new walltime of 00:01:06
I'm still testing. Will push tonight.
Pushed
Great! It works. Thank you.! Only it says strange job state encountered
. This is a bit scary. Is this by design or a bug?
Jobs submitted: 8610595
unittestGalapagosSenDT128, smallbaseline_wrapper, 1 jobs : 0 COMPLETED , 1 RUNNING , 0 PENDING , 0 WAITING .
unittestGalapagosSenDT128, smallbaseline_wrapper, 1 jobs : 0 COMPLETED , 1 RUNNING , 0 PENDING , 0 WAITING .
unittestGalapagosSenDT128, smallbaseline_wrapper, 1 jobs : 0 COMPLETED , 1 RUNNING , 0 PENDING , 0 WAITING .
unittestGalapagosSenDT128, smallbaseline_wrapper, 1 jobs : 0 COMPLETED , 1 RUNNING , 0 PENDING , 0 WAITING .
/scratch/05861/tg851601/unittestGalapagosSenDT128/smallbaseline_wrapper.job timedout with walltime of 00:01:19.
Resubmitting file (/scratch/05861/tg851601/unittestGalapagosSenDT128/smallbaseline_wrapper.job) with new walltime of 00:01:34
Resubmitted as jobumber: 8610603.
Strange job state: ----------, encountered.
unittestGalapagosSenDT128, smallbaseline_wrapper, 1 jobs : 0 COMPLETED , 0 RUNNING , 0 PENDING , 1 WAITING .
unittestGalapagosSenDT128, smallbaseline_wrapper, 1 jobs : 0 COMPLETED , 1 RUNNING , 0 PENDING , 0 WAITING .
unittestGalapagosSenDT128, smallbaseline_wrapper, 1 jobs : 0 COMPLETED , 1 RUNNING , 0 PENDING , 0 WAITING .
unittestGalapagosSenDT128, smallbaseline_wrapper, 1 jobs : 1 COMPLETED , 0 RUNNING , 0 PENDING , 0 WAITING .
check_job_outputs.py /scratch/05861/tg851601/unittestGalapagosSenDT128/smallbaseline_wrapper.job --tmp
This was originally a line met for debugging purposes, but I have retained so I can verify if anything wacky occurs. A job very briefly has no state associated with it immediately after being submitted, which is caught when resubmitted after TIMEOUT. It's consistent and not something to be concerned about.
When a
smallbaseline.job
times out it stalls instead of rerunning. Forinsarmaps.job
the same applies. You can reproduce by changing the walltime insmallbaseline.job
to00:01:00
and run: