Closed falkamelung closed 3 years ago
This is simply akin to summing the total number of active tasks across all job steps and all projects correct?
Yes.
Are you ever going to hit 1500 tasks if you also have the MAX_JOBS_PER_QUEUE check? I think that either the MAX_JOBS check or the stepwise max task check will catch an obscene number of tasks before a global maximum will.
In testing, the most tasks I can get to submit at once (with both the MAX_JOBS and MAX_TAKS_STEP checks), is <400.
Also, I recommend you make a file somewhere that contains all of the step restrictions (number of tasks, io load constance, etc.). It'll be much more sustainable to source them from a file than having them hardcoded.
Yes we will. One dataset may have 20,000 tasks simultaneously. On stampede we never reach this because jobs are pending but on Frontera it may happen that you have tens of jobs running simultaneously (after somebody else's big job
The MAX_JOBS_PER_QUEUE is a system restriction. The standard on Frontera is 100 jobs. If each job runs 1000 tasks we end up with 100,000 tasks simultaneously....
During the TEXASCALE days you are allowed to run only if your code can run more than 2000 jobs simultaneously....
Ok. Implemented and pushed.
Regarding specifying the task limits, did you see the comment https://github.com/geodesymiami/rsmas_insar/issues/450#issuecomment-756381093 ? This is how I think we should specify the load for each task. The MAX_TASKS_TOTAL will be specified in https://github.com/geodesymiami/rsmas_insar/blob/master/minsar/defaults/queues.cfg
But I first have to use it to be sure that is works as I envision it.
Yes, that is fine, though I think such a file should only contain data needed for limiting job submissions (so no walltime stuff). I would need you to write the file with the appropriate step names, task maximums, and IO loads.
Yes, will do. It probably will be the same file. There are different max_tasks for different queues as they are different hardware.
We should use the script that you created for platform_defaults
.
Can't do exactly the same thing, as we wouldn't want a unique environment variable for each step. I can probably use the read_platform_defaults
as a base though. Likely I read the file into an associative array in bash and use it like that.
Hi @Ovec8hkin , it is not fully correct yet. It currently checks only for one step number (500 in this case). I did not see that total_max_tasks
is set anywhere.
Jobfiles to run: /scratch/05861/tg851601/KokoxiliBigChunk30SenDT48/run_files/run_02_unpack_secondary_slc_0.job /scratch/05861/tg851601/KokoxiliBigChunk30SenDT48/run_files/run_02_unpack_secondary_slc_1.job /scratch/05861/tg851601/KokoxiliBigChunk30SenDT48/run_files/run_02_unpack_secondary_slc_2.job
unpack_secondary_slc: number of running/pending tasks is 0 (maximum 500)
/scratch/05861/tg851601/KokoxiliBigChunk30SenDT48/run_files/run_02_unpack_secondary_slc_0.job: 45 additional tasks
unpack_secondary_slc: 45 total tasks (maximum 500)
Number of running/pending jobs: 3
unpack_secondary_slc: number of running/pending tasks is 45 (maximum 500)
/scratch/05861/tg851601/KokoxiliBigChunk30SenDT48/run_files/run_02_unpack_secondary_slc_1.job: 45 additional tasks
unpack_secondary_slc: 90 total tasks (maximum 500)
Number of running/pending jobs: 4
unpack_secondary_slc: number of running/pending tasks is 90 (maximum 500)
/scratch/05861/tg851601/KokoxiliBigChunk30SenDT48/run_files/run_02_unpack_secondary_slc_2.job: 43 additional tasks
unpack_secondary_slc: 133 total tasks (maximum 500)
Number of running/pending jobs: 5
Jobs submitted: 7147543 7147544 7147545
KokoxiliBigChunk30SenDT48 run_02_unpack_secondary_slc_2.job, 7147543 is not finished yet. Current state is 'PENDING '
However, it should check for both step_max_tasks
(the maximum number of tasks per step) and total_max_tasks
(the total number of tasks). So we should have step_num_tasks <= step_max_tasks
and total_num_tasks <= max_num_tasks
.
So, if we have each 400 step1, step2 and step3 tasks running, we havetotal_num_tasks=1200
and can only submit 300 additional tasks (assuming total_max_tasks=1500
).
When you commit again can you set all step_max_task_list
entries to 500? Then the modifiable parameters are:
total_max_task = 1500
step_max_task_list=(
[unpack_topo_reference]=500
[unpack_secondary_slc]=500
[average_baseline]=500
[extract_burst_overlaps]=500
[overlap_geo2rdr]=500
[overlap_resample]=500
[pairs_misreg]=500
[timeseries_misreg]=500
[fullBurst_geo2rdr]=500
[fullBurst_resample]=500
[extract_stack_valid_region]=500
[merge_reference_secondary_slc]=500
[generate_burst_igram]=500
[merge_burst_igram]=500
[filter_coherence]=500
[unwrap]=500
)
Hi @Ovec8hkin , it does not work yet, unfortunately. Same bug as before. It seems to happen when lots of jobs are submitted and when the 25 max is reached....
It may be more effective to start the jobs in your area . If you have two completely done jobs it is easy to start lots of jobs by re-running one step (e.g. job_submit.bash $PWD --start 9 )
submit_jobs.bash /scratch/05861/tg851601/wd2KokoxiliBigChunk30SenDT48 --start 8
Jobfiles to run: /scratch/05861/tg851601/wd2KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_0.job /scratch/05861/tg851601/wd2KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_10.job /scratch/05861/tg851601/wd2KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_11.job /scratch/05861/tg851601/wd2KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_12.job /scratch/05861/tg851601/wd2KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_13.job /scratch/05861/tg851601/wd2KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_14.job /scratch/05861/tg851601/wd2KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_15.job /scratch/05861/tg851601/wd2KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_16.job /scratch/05861/tg851601/wd2KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_17.job /scratch/05861/tg851601/wd2KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_18.job /scratch/05861/tg851601/wd2KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_19.job /scratch/05861/tg851601/wd2KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_1.job /scratch/05861/tg851601/wd2KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_20.job /scratch/05861/tg851601/wd2KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_21.job /scratch/05861/tg851601/wd2KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_22.job /scratch/05861/tg851601/wd2KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_23.job /scratch/05861/tg851601/wd2KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_2.job /scratch/05861/tg851601/wd2KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_3.job /scratch/05861/tg851601/wd2KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_4.job /scratch/05861/tg851601/wd2KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_5.job /scratch/05861/tg851601/wd2KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_6.job /scratch/05861/tg851601/wd2KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_7.job /scratch/05861/tg851601/wd2KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_8.job /scratch/05861/tg851601/wd2KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_9.job
generate_burst_igram: number of running/pending tasks is 0 (maximum 500)
/scratch/05861/tg851601/wd2KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_0.job: 22 additional tasks
generate_burst_igram: 22 total tasks (maximum 500)
Number of running/pending jobs: 3
: number of running/pending tasks is 24 (maximum )
/scratch/05861/tg851601/KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_10.job: 44 additional tasks
: 68 total tasks (maximum )
Number of running/pending jobs: 1
Couldnt submit job (/scratch/05861/tg851601/KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_10.job), because there are 25 active jobs right now. Waiting 5 minutes to submit next job.
/work/05861/tg851601/stampede2/test/codedev1/rsmas_insar/minsar/submit_jobs.bash: line 53: step_max_task_list: bad array subscript
: number of running/pending tasks is 24 (maximum )
/scratch/05861/tg851601/KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_10.job: 44 additional tasks
: 68 total tasks (maximum )
Number of running/pending jobs: 1
Couldnt submit job (/scratch/05861/tg851601/KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_10.job), because there are 25 active jobs right now. Waiting 5 minutes to submit next job.
/work/05861/tg851601/stampede2/test/codedev1/rsmas_insar/minsar/submit_jobs.bash: line 53: step_max_task_list: bad array subscript
: number of running/pending tasks is 0 (maximum )
/scratch/05861/tg851601/KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_10.job: 44 additional tasks
: 44 total tasks (maximum )
Number of running/pending jobs: 0
I don't think your using the most recent code. The logging output you're attaching doesn't match the updates I made yesterday.
Below the code I am using. The latest from master
.
If I remove run_08_*10.job
it works fine (see second case)
submit_jobs.bash $PWD --start 8
Jobfiles to run: /scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_0.job /scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_10.job /scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_1.job /scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_2.job /scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_3.job /scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_4.job /scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_5.job /scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_6.job /scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_7.job /scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_8.job /scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_9.job
generate_burst_igram: number of running/pending tasks is 0 (maximum 500)
/scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_0.job: 45 additional tasks
generate_burst_igram: 45 total tasks (maximum 500)
Number of running/pending jobs: 0
/work/05861/tg851601/stampede2/test/codedev1/rsmas_insar/minsar/submit_jobs.bash: line 53: step_max_task_list: bad array subscript
: number of running/pending tasks is 45 (maximum )
/scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_10.job: 45 additional tasks
: 90 total tasks (maximum )
Number of running/pending jobs: 1
Couldnt submit job (/scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_10.job), because there are 25 active jobs right now. Waiting 5 minutes to submit next job.
mv run_08_*10.job others
submit_jobs.bash $PWD --start 8
Jobfiles to run: /scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_0.job /scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_1.job /scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_2.job /scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_3.job /scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_4.job /scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_5.job /scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_6.job /scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_7.job /scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_8.job /scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_9.job
generate_burst_igram: number of running/pending tasks is 0 (maximum 500)
/scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_0.job: 45 additional tasks
generate_burst_igram: 45 total tasks (maximum 500)
Number of running/pending jobs: 0
generate_burst_igram: number of running/pending tasks is 45 (maximum 500)
/scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_1.job: 45 additional tasks
generate_burst_igram: 90 total tasks (maximum 500)
Number of running/pending jobs: 1
generate_burst_igram: number of running/pending tasks is 90 (maximum 500)
/scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_2.job: 45 additional tasks
generate_burst_igram: 135 total tasks (maximum 500)
Number of running/pending jobs: 2
generate_burst_igram: number of running/pending tasks is 135 (maximum 500)
/scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_3.job: 45 additional tasks
generate_burst_igram: 180 total tasks (maximum 500)
Number of running/pending jobs: 3
generate_burst_igram: number of running/pending tasks is 180 (maximum 500)
/scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_4.job: 45 additional tasks
generate_burst_igram: 225 total tasks (maximum 500)
Number of running/pending jobs: 4
generate_burst_igram: number of running/pending tasks is 225 (maximum 500)
/scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_5.job: 45 additional tasks
generate_burst_igram: 270 total tasks (maximum 500)
Number of running/pending jobs: 5
generate_burst_igram: number of running/pending tasks is 270 (maximum 500)
/scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_6.job: 45 additional tasks
generate_burst_igram: 315 total tasks (maximum 500)
Number of running/pending jobs: 6
generate_burst_igram: number of running/pending tasks is 315 (maximum 500)
/scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_7.job: 45 additional tasks
generate_burst_igram: 360 total tasks (maximum 500)
Number of running/pending jobs: 7
generate_burst_igram: number of running/pending tasks is 360 (maximum 500)
/scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_8.job: 45 additional tasks
generate_burst_igram: 405 total tasks (maximum 500)
Number of running/pending jobs: 8
generate_burst_igram: number of running/pending tasks is 405 (maximum 500)
/scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_9.job: 45 additional tasks
generate_burst_igram: 450 total tasks (maximum 500)
Number of running/pending jobs: 9
Jobs submitted: 7149864 7149865 7149867 7149868 7149869 7149870 7149871 7149872 7149873 7149874
cat /work/05861/tg851601/stampede2/test/codedev1/rsmas_insar/minsar/submit_jobs.bash
##! /bin/bash
#set -x
#trap read debug
function compute_tasks_for_step {
file=$1
stepname=$2
IFS=$'\n'
running_tasks=$(squeue -u $USER --format="%j %A" -rh)
job_ids=($(echo $running_tasks | grep -oP "\d{4,}"))
unset IFS
tasks=0
for j in "${job_ids[@]}"; do
task_file=$(scontrol show jobid -dd $j | grep -oP "(?<=Command=)(.*)(?=.job)")
if [[ "$task_file" == *"$stepname"* ]]; then
num_tasks=$(cat $task_file | wc -l)
((tasks=tasks+$num_tasks))
fi
done
echo $tasks
return 0
}
function submit_job_conditional {
declare -A step_max_task_list
step_max_task_list=(
[unpack_topo_reference]=500
[unpack_secondary_slc]=500
[average_baseline]=500
[extract_burst_overlaps]=500
[overlap_geo2rdr]=140
[overlap_resample]=140
[pairs_misreg]=50
[timeseries_misreg]=100
[fullBurst_geo2rdr]=500
[fullBurst_resample]=500
[extract_stack_valid_region]=500
[merge_reference_secondary_slc]=500
[generate_burst_igram]=500
[merge_burst_igram]=500
[filter_coherence]=500
[unwrap]=500
)
file=$1
step_name=$(echo $file | grep -oP "(?<=run_\d{2}_)(.*)(?=_\d{1}.job)")
step_max_tasks="${step_max_task_list[$step_name]}"
num_active_tasks=$(compute_tasks_for_step $file $step_name)
num_tasks_job=$(cat ${file%.*} | wc -l)
total_tasks=$(($num_active_tasks+$num_tasks_job))
echo "$step_name: number of running/pending tasks is $num_active_tasks (maximum $step_max_tasks)" >&2
echo "$file: $num_tasks_job additional tasks" >&2
echo "$step_name: $total_tasks total tasks (maximum $step_max_tasks)" >&2
num_active_jobs=$(squeue -u $USER -h -t running,pending -r | wc -l )
echo "Number of running/pending jobs: $num_active_jobs" >&2
if [[ $num_active_jobs -lt $MAX_JOBS_PER_QUEUE ]] && [[ $total_tasks -lt $step_max_tasks ]]; then
job_submit_message=$(sbatch $file)
exit_status="$?"
if [[ $exit_status -ne 0 ]]; then
echo "sbatch message: $job_submit_message" >&2
echo "sbatch submit error: exit code $exit_status. Sleep 30 seconds and try again" >&2
sleep 30
job_submit_message=$(sbatch $file | grep "Submitted batch job")
exit_status="$?"
if [[ $exit_status -ne 0 ]]; then
echo "sbatch error message: $job_submit_message" >&2
echo "sbatch submit error: exit code $exit_status. Sleep 60 seconds and try again" >&2
sleep 60
job_submit_message=$(sbatch $file | grep "Submitted batch job")
exit_status="$?"
if [[ $exit_status -ne 0 ]]; then
echo "sbatch error message: $job_submit_message" >&2
echo "sbatch submit error again: exit code $exit_status. Exiting." >&2
exit 1
fi
fi
fi
jobnumber=$(grep -oE "[0-9]{7}" <<< $job_submit_message)
echo $jobnumber
return 0
fi
return 1
}
if [[ "$1" == "--help" || "$1" == "-h" ]]; then
helptext=" \n\
Job submission script
usage: submit_jobs.bash custom_template_file [--start] [--stop] [--dostep] [--help]\n\
\n\
Examples: \n\
submit_jobs.bash \$SAMPLESDIR/unittestGalapagosSenDT128.template \n\
submit_jobs.bash \$SAMPLESDIR/unittestGalapagosSenDT128.template --start 2 \n\
submit_jobs.bash \$SAMPLESDIR/unittestGalapagosSenDT128.template --dostep 4 \n\
submit_jobs.bash \$SAMPLESDIR/unittestGalapagosSenDT128.template --stop 8 \n\
submit_jobs.bash \$SAMPLESDIR/unittestGalapagosSenDT128.template --start timeseries \n\
submit_jobs.bash \$SAMPLESDIR/unittestGalapagosSenDT128.template --dostep insarmaps \n\
\n\
Processing steps (start/end/dostep): \n\
\n\
['1-16', 'timeseries', 'insarmaps' ] \n\
\n\
In order to use either --start or --dostep, it is necessary that a \n\
previous run was done using one of the steps options to process at least \n\
through the step immediately preceding the starting step of the current run. \n\
\n\
--start STEP start processing at the named step [default: load_data].\n\
--end STEP, --stop STEP \n\
end processing at the named step [default: upload] \n\
--dostep STEP run processing at the named step only \n
"
printf "$helptext"
exit 0;
else
PROJECT_NAME=$(basename "$1" | cut -d. -f1)
fi
WORKDIR=$SCRATCHDIR/$PROJECT_NAME
RUNFILES_DIR=$WORKDIR"/run_files"
cd $WORKDIR
startstep=1
stopstep="insarmaps"
start_datetime=$(date +"%Y%m%d:%H-%m")
echo "${start_datetime} * submit_jobs.bash ${WORKDIR} ${@:2}" >> "${WORKDIR}"/log
while [[ $# -gt 0 ]]
do
key="$1"
case $key in
--start)
startstep="$2"
shift # past argument
shift # past value
;;
--stop)
stopstep="$2"
shift
shift
;;
--dostep)
startstep="$2"
stopstep="$2"
shift
shift
;;
*)
POSITIONAL+=("$1") # save it in an array for later
shift # past argument
;;
esac
done
set -- "${POSITIONAL[@]}" # restore positional parameters
#find the last job (11 for 'geometry' and 16 for 'NESD')
job_file_arr=(run_files/run_*_0.job)
last_job_file=${job_file_arr[-1]}
last_job_file_number=${last_job_file:14:2}
if [[ $startstep == "ifgrams" ]]; then
startstep=1
elif [[ $startstep == "timeseries" ]]; then
startstep=$((last_job_file_number+1))
elif [[ $startstep == "insarmaps" ]]; then
startstep=$((last_job_file_number+2))
fi
if [[ $stopstep == "ifgrams" ]]; then
stopstep=$last_job_file_number
elif [[ $stopstep == "timeseries" ]]; then
stopstep=$((last_job_file_number+1))
elif [[ $stopstep == "insarmaps" ]]; then
stopstep=$((last_job_file_number+2))
fi
for (( i=$startstep; i<=$stopstep; i++ )) do
stepnum="$(printf "%02d" ${i})"
if [[ $i -le $last_job_file_number ]]; then
fname="$RUNFILES_DIR/run_${stepnum}_*.job"
elif [[ $i -eq $((last_job_file_number+1)) ]]; then
fname="$WORKDIR/smallbaseline_wrapper*.job"
else
fname="$WORKDIR/insarmaps*.job"
fi
globlist+=("$fname")
done
for g in "${globlist[@]}"; do
files=($g)
echo "Jobfiles to run: ${files[@]}"
# Submit all of the jobs and record all of their job numbers
jobnumbers=()
for (( f=0; f < "${#files[@]}"; f++ )); do
file=${files[$f]}
jobnumber=$(submit_job_conditional $file)
exit_status="$?"
if [[ $exit_status -eq 0 ]]; then
jobnumbers+=("$jobnumber")
else
echo "Couldnt submit job (${file}), because there are $MAX_JOBS_PER_QUEUE active jobs right now. Waiting 5 minutes to submit next job."
f=$((f-1))
sleep 300 # sleep for 5 minutes
fi
done
echo "Jobs submitted: ${jobnumbers[@]}"
sleep 5
# Wait for each job to complete
for (( j=0; j < "${#jobnumbers[@]}"; j++)); do
jobnumber=${jobnumbers[$j]}
# Parse out the state of the job from the sacct function.
# Format state to be all uppercase (PENDING, RUNNING, or COMPLETED)
# and remove leading whitespace characters.
state=$(sacct --format="State" -j $jobnumber | grep "\w[[:upper:]]\w")
state="$(echo -e "${state}" | sed -e 's/^[[:space:]]*//')"
# Keep checking the state while it is not "COMPLETED"
secs=0
while true; do
# Only print every so often, not every 30 seconds
if [[ $(( $secs % 30)) -eq 0 ]]; then
echo "$(basename $WORKDIR) $(basename "$file"), ${jobnumber} is not finished yet. Current state is '${state}'"
fi
state=$(sacct --format="State" -j $jobnumber | grep "\w[[:upper:]]\w")
state="$(echo -e "${state}" | sed -e 's/^[[:space:]]*//')"
# Check if "COMPLETED" is anywhere in the state string variables.
# This gets rid of some strange special character issues.
if [[ $state == *"TIMEOUT"* ]] && [[ $state != "" ]]; then
jf=${files[$j]}
init_walltime=$(grep -oP '(?<=#SBATCH -t )[0-9]+:[0-9]+:[0-9]+' $jf)
echo "${jobnumber} timedout due to too low a walltime (${init_walltime})."
# Compute a new walltime and update the job file
update_walltime.py "$jf" &> /dev/null
updated_walltime=$(grep -oP '(?<=#SBATCH -t )[0-9]+:[0-9]+:[0-9]+' $jf)
datetime=$(date +"%Y-%m-%d:%H-%M")
echo "${datetime}: re-running: ${jf}: ${init_walltime} --> ${updated_walltime}" >> "${RUNFILES_DIR}"/rerun.log
echo "Resubmitting file (${jf}) with new walltime of ${updated_walltime}"
# Resubmit as a new job number
jobnumber=$(submit_job_conditional $jf)
exit_status="$?"
if [[ $exit_status -eq 0 ]]; then
jobnumbers+=("$jobnumber")
files+=("$jf")
echo "${jf} resubmitted as jobumber: ${jobnumber}"
else
echo "sbatch re-submit error message: $jobnumber"
echo "sbatch re-submit error: exit code $exit_status. Exiting."
exit 1
fi
break;
elif [[ $state == *"COMPLETED"* ]] && [[ $state != "" ]]; then
state="COMPLETED"
echo "${jobnumber} is complete"
break;
elif [[ ( $state == *"FAILED"* ) && $state != "" ]]; then
echo "${jobnumber} FAILED. Exiting with status code 1."
exit 1;
elif [[ ( $state == *"CANCELLED"* ) && $state != "" ]]; then
echo "${jobnumber} was CANCELLED. Exiting with status code 1."
exit 1;
fi
# Wait for 30 second before chcking again
sleep 30
((secs=secs+30))
done
echo Job"${jobnumber} is finished."
done
# Run check_job_output.py on all files
cmd="check_job_outputs.py ${files[@]}"
echo "$cmd"
$cmd
exit_status="$?"
if [[ $exit_status -ne 0 ]]; then
echo "Error in submit_jobs.bash: check_job_outputs.py exited with code ($exit_status)."
exit 1
fi
done
That's not the most updated code. If you view the file on GitHub, the most recent copy starts with a function named compute_total_tasks
.
Hah, I did something stupid. My bad!
Yes, it works now. Great! This is big progress! Thank you !!!!
Two minor items:
total_tasks
it stays below 1500. This is what it should do. For step_tasks
it goes above 500. Does it really do this or does it display the number of tasks including the job it wants to submit? It would be preferred to stay below 500./scratch/05861/tg851601/KilaueaSenAT124/run_files/run_15_filter_coherence_2.job: 88 additional tasks
filter_coherence: 438 total tasks (maximum 500)
1361 tasks running across all jobs (maximum 1500)
Number of running/pending jobs: 20
1361 tasks running across all jobs (maximum 1500)
filter_coherence: 438 running/pending tasks (maximum 500)
/scratch/05861/tg851601/KilaueaSenAT124/run_files/run_15_filter_coherence_3.job: 88 additional tasks
filter_coherence: 526 total tasks (maximum 500)
1449 tasks running across all jobs (maximum 1500)
Number of running/pending jobs: 21
Couldnt submit job (/scratch/05861/tg851601/KilaueaSenAT124/run_files/run_15_filter_coherence_3.job), because there are too many active tasks for this step right now. Waiting 5 minutes to submit next job.
generate_burst_igram: 485 running/pending tasks
and generate_burst_igram: 573 total tasks
?
submit_jobs.bash $PWD --start 13
Jobfiles to run: /scratch/05861/tg851601/KilaueaSenAT124/run_files/run_13_generate_burst_igram_0.job /scratch/05861/tg851601/KilaueaSenAT124/run_files/run_13_generate_burst_igram_10.job /scratch/05861/tg851601/KilaueaSenAT124/run_files/run_13_generate_burst_igram_11.job /scratch/05861/tg851601/KilaueaSenAT124/run_files/run_13_generate_burst_igram_1.job /scratch/05861/tg851601/KilaueaSenAT124/run_files/run_13_generate_burst_igram_2.job /scratch/05861/tg851601/KilaueaSenAT124/run_files/run_13_generate_burst_igram_3.job /scratch/05861/tg851601/KilaueaSenAT124/run_files/run_13_generate_burst_igram_4.job /scratch/05861/tg851601/KilaueaSenAT124/run_files/run_13_generate_burst_igram_5.job /scratch/05861/tg851601/KilaueaSenAT124/run_files/run_13_generate_burst_igram_6.job /scratch/05861/tg851601/KilaueaSenAT124/run_files/run_13_generate_burst_igram_7.job /scratch/05861/tg851601/KilaueaSenAT124/run_files/run_13_generate_burst_igram_8.job /scratch/05861/tg851601/KilaueaSenAT124/run_files/run_13_generate_burst_igram_9.job
485 tasks running across all jobs (maximum 1500)
generate_burst_igram: 485 running/pending tasks (maximum 500)
/scratch/05861/tg851601/KilaueaSenAT124/run_files/run_13_generate_burst_igram_0.job: 88 additional tasks
generate_burst_igram: 573 total tasks (maximum 500)
573 tasks running across all jobs (maximum 1500)
Number of running/pending jobs: 11
Can we always use running/pending
as this is what it is? In the script the variable name preferably should contain active
(you might have that already). For messaging, how about:
generate_burst_igram: 573 running/pending tasks (maximum 500)
all jobs: 1361 running/pending tasks (maximum 1500)
There is a subtle difference between the two lines of logging:
>>> generate_burst_igram: 485 running/pending tasks (maximum 500)
>>> /scratch/05861/tg851601/KilaueaSenAT124/run_files/run_13_generate_burst_igram_0.job: 88 additional tasks
>>> generate_burst_igram: 573 total tasks (maximum 500)
Line 1 indicates the number of running/pending tasks for the specified job NOT including the job that we are attempting to submit at that moment. Line 2 indicates the number of tasks associated with the job we are attempting to submit. Line 3 indicates the number of tasks that WOULD be running if the job is successfully submitted. If the number of tasks reported on line 3 exceeds the maximum number for the step, the job does not submit.
That is what I thought. So line 3 is not needed, right?
Not technically no. I leave it there for debugging purposes. You can comment it out if you would like.
Are you going to commit anything on this file? Else I will so some minor modifications before I forget.
I have no pending changes to make here.
Hi @Ovec8hkin
There is one bug left. When you run --start timeseries
(or --start insarmaps
it does not work because the job files don't start with run_
. This line https://github.com/geodesymiami/rsmas_insar/blob/d1c57b3219cd2a5a36cfaef04c794fc7685cc89c/minsar/submit_jobs.bash#L79
creates an error. A possible solution would be to first check whether the jobname contains `run. Else skip the conditional completely for
timeseriesand
insarmaps`
submit_jobs.bash /scratch/05861/tg851601/unittestGalapagosSenDT128 --start timeseries
Jobfiles to run: /scratch/05861/tg851601/unittestGalapagosSenDT128/smallbaseline_wrapper.job
/work/05861/tg851601/stampede2/test/codedev1/rsmas_insar/minsar/submit_jobs.bash: line 80: step_io_load_list: bad array subscript
(standard_in) 2: syntax error
I cleant the naming up a bit and switched to IO_load (it is used to calculate step_max_tasks)
I just pushed thus.
Hi @Ovec8hkin Can you please check whether the re-running is properly implemented when the maximum number of tasks is reached? In the case below it does not. This looks like the previous issue with re-running not working when 25 jobs were reached.
If you don't see the issue quickly I have to do some efforts to recreate.
On a related issue, looking at the code and suggested by Sara, I think we should remove the submit_jobs_conditional
function from submit_jobs.bash
and make it its own script sbatch_conditional.bash
. That will make submit_jobs.bash
much lighter. If we need to submit a job we could use either sbatch
or sbatch_conditional
. That was
The regular sbatch
output is at the end of this comment. sbatch_conditional.bash
would show something similar but with the tasks checks and potentially showing the waiting messages. The regular `sbatch fails if the job limit is reached. Ours does not fail but waits and then tries again. How do you think? I like this idea a lot!
708 running/pending tasks across all jobs (maximum 2000)
step generate_burst_igram: 46 running/pending tasks (maximum 500)
run_08_generate_burst_igram_8.job: 46 additional tasks
Number of running/pending jobs: 17
754 running/pending tasks across all jobs (maximum 2000)
step generate_burst_igram: 92 running/pending tasks (maximum 500)
run_08_generate_burst_igram_9.job: 46 additional tasks
Jobs submitted: 7175671 7175672 7175673 7175674 7175675 7175676 7175677 7175678 7175679 7175701 7175718 7175719
KokoxiliBigChunk40SenDT48 run_08_generate_burst_igram_9.job, 7175671 is not finished yet. Current state is 'TIMEOUT '
7175671 timedout due to too low a walltime (0:18:40).
Resubmitting file (/scratch/05861/tg851601/KokoxiliBigChunk40SenDT48/run_files/run_08_generate_burst_igram_0.job) with new walltime of 00:22:24
Number of running/pending jobs: 18
800 running/pending tasks across all jobs (maximum 2000)
step generate_burst_igram: 138 running/pending tasks (maximum 500)
run_08_generate_burst_igram_0.job: 46 additional tasks
/scratch/05861/tg851601/KokoxiliBigChunk40SenDT48/run_files/run_08_generate_burst_igram_0.job resubmitted as jobumber: 7175720
Job7175720 is finished.
KokoxiliBigChunk40SenDT48 run_08_generate_burst_igram_9.job, 7175672 is not finished yet. Current state is 'TIMEOUT '
7175672 timedout due to too low a walltime (0:18:40).
Resubmitting file (/scratch/05861/tg851601/KokoxiliBigChunk40SenDT48/run_files/run_08_generate_burst_igram_10.job) with new walltime of 00:22:24
Number of running/pending jobs: 19
846 running/pending tasks across all jobs (maximum 2000)
step generate_burst_igram: 184 running/pending tasks (maximum 500)
run_08_generate_burst_igram_10.job: 46 additional tasks
/scratch/05861/tg851601/KokoxiliBigChunk40SenDT48/run_files/run_08_generate_burst_igram_10.job resubmitted as jobumber: 7175722
Job7175722 is finished.
KokoxiliBigChunk40SenDT48 run_08_generate_burst_igram_9.job, 7175673 is not finished yet. Current state is 'TIMEOUT '
7175673 timedout due to too low a walltime (0:18:40).
Resubmitting file (/scratch/05861/tg851601/KokoxiliBigChunk40SenDT48/run_files/run_08_generate_burst_igram_11.job) with new walltime of 00:22:24
Number of running/pending jobs: 20
892 running/pending tasks across all jobs (maximum 2000)
step generate_burst_igram: 230 running/pending tasks (maximum 500)
run_08_generate_burst_igram_11.job: 40 additional tasks
/scratch/05861/tg851601/KokoxiliBigChunk40SenDT48/run_files/run_08_generate_burst_igram_11.job resubmitted as jobumber: 7175723
Job7175723 is finished.
KokoxiliBigChunk40SenDT48 run_08_generate_burst_igram_9.job, 7175674 is not finished yet. Current state is 'TIMEOUT '
7175674 timedout due to too low a walltime (0:18:40).
Resubmitting file (/scratch/05861/tg851601/KokoxiliBigChunk40SenDT48/run_files/run_08_generate_burst_igram_1.job) with new walltime of 00:22:24
Number of running/pending jobs: 21
932 running/pending tasks across all jobs (maximum 2000)
step generate_burst_igram: 270 running/pending tasks (maximum 500)
run_08_generate_burst_igram_1.job: 46 additional tasks
/scratch/05861/tg851601/KokoxiliBigChunk40SenDT48/run_files/run_08_generate_burst_igram_1.job resubmitted as jobumber: 7175724
Job7175724 is finished.
KokoxiliBigChunk40SenDT48 run_08_generate_burst_igram_9.job, 7175675 is not finished yet. Current state is 'TIMEOUT '
7175675 timedout due to too low a walltime (0:18:40).
Resubmitting file (/scratch/05861/tg851601/KokoxiliBigChunk40SenDT48/run_files/run_08_generate_burst_igram_2.job) with new walltime of 00:22:24
Number of running/pending jobs: 22
978 running/pending tasks across all jobs (maximum 2000)
step generate_burst_igram: 316 running/pending tasks (maximum 500)
run_08_generate_burst_igram_2.job: 46 additional tasks
/scratch/05861/tg851601/KokoxiliBigChunk40SenDT48/run_files/run_08_generate_burst_igram_2.job resubmitted as jobumber: 7175725
Job7175725 is finished.
KokoxiliBigChunk40SenDT48 run_08_generate_burst_igram_9.job, 7175676 is not finished yet. Current state is 'TIMEOUT '
7175676 timedout due to too low a walltime (0:18:40).
Resubmitting file (/scratch/05861/tg851601/KokoxiliBigChunk40SenDT48/run_files/run_08_generate_burst_igram_3.job) with new walltime of 00:22:24
Number of running/pending jobs: 23
1024 running/pending tasks across all jobs (maximum 2000)
step generate_burst_igram: 362 running/pending tasks (maximum 500)
run_08_generate_burst_igram_3.job: 46 additional tasks
/scratch/05861/tg851601/KokoxiliBigChunk40SenDT48/run_files/run_08_generate_burst_igram_3.job resubmitted as jobumber: 7175736
Job7175736 is finished.
KokoxiliBigChunk40SenDT48 run_08_generate_burst_igram_9.job, 7175677 is not finished yet. Current state is 'TIMEOUT '
7175677 timedout due to too low a walltime (0:18:40).
Resubmitting file (/scratch/05861/tg851601/KokoxiliBigChunk40SenDT48/run_files/run_08_generate_burst_igram_4.job) with new walltime of 00:22:24
Number of running/pending jobs: 24
1070 running/pending tasks across all jobs (maximum 2000)
step generate_burst_igram: 408 running/pending tasks (maximum 500)
run_08_generate_burst_igram_4.job: 46 additional tasks
/scratch/05861/tg851601/KokoxiliBigChunk40SenDT48/run_files/run_08_generate_burst_igram_4.job resubmitted as jobumber: 7175737
Job7175737 is finished.
KokoxiliBigChunk40SenDT48 run_08_generate_burst_igram_9.job, 7175678 is not finished yet. Current state is 'TIMEOUT '
7175678 timedout due to too low a walltime (0:18:40).
Resubmitting file (/scratch/05861/tg851601/KokoxiliBigChunk40SenDT48/run_files/run_08_generate_burst_igram_5.job) with new walltime of 00:22:24
Number of running/pending jobs: 25
1116 running/pending tasks across all jobs (maximum 2000)
step generate_burst_igram: 454 running/pending tasks (maximum 500)
run_08_generate_burst_igram_5.job: 46 additional tasks
sbatch re-submit error message:
sbatch re-submit error: exit code 1. Exiting.
Regular sbatch output
idevdev
-> Checking on the status of skx-dev queue. OK
-> Defaults file : ~/.idevrc
-> System : stampede2
-> Queue : skx-dev (cmd line: -p )
-> Nodes : 1 (cmd line: -N )
-> Total tasks : 48 (cmd line: -n )
-> Time (minutes) : 30 (idev default )
-> Project : TG-EAR200012 (~/.idevrc )
-----------------------------------------------------------------
Welcome to the Stampede2 Supercomputer
-----------------------------------------------------------------
No reservation for this job
--> Verifying valid submit host (login3)...OK
--> Verifying valid jobname...OK
--> Enforcing max jobs per user...OK
--> Verifying availability of your home dir (/home1/05861/tg851601)...OK
--> Verifying availability of your work dir (/work/05861/tg851601/stampede2)...OK
--> Verifying availability of your scratch dir (/scratch/05861/tg851601)...OK
--> Verifying valid ssh keys...OK
--> Verifying access to desired queue (skx-dev)...OK
--> Verifying job request is within current queue limits...OK
--> Checking available allocation (TG-EAR200012)...OK
--> Verifying that quota for filesystem /home1/05861/tg851601 is at 82.76% allocated...OK
--> Verifying that quota for filesystem /work/05861/tg851601/stampede2 is at 34.19% allocated...OK
Submitted batch job 7176714
Hi @falkamelung and @Ovec8hkin
Yes, this would be really helpful if implemented. I need sbatch_conditional.bash to receive a batch file (with any naming convention) and submit all corresponding jobs (numbered like *_1.job). And then it does all the checking for number of tasks and fails.
for example I have a batch file named: run_my_batchfile and I have split the tasks in multiple jobs like: run_my_batchfile_0.job, run_my_batchfile_1.job, ... now I want to submit them using sbatch_conditional.bash: sbatch_conditional.bash run_my_batchfile
it should take care of finding the jobs, submitting and checking everything
@falkamelung Whats the problem here? I don't see the task limits being exceeded. Im using the sbatch_conditional
for resubmission, so all of the checks are performed if there is a TIMEOUT error.
It says sbatch re-submit error: exit code 1
and then it exits. It should not exit. Did you see that?
It throws that error message and exits after attempting submit three times and failing. If you want to change the wait times, there in the sbatch_conditional
function.
This is not clear to me yet. The message above was produced here: https://github.com/geodesymiami/rsmas_insar/blob/caff6511106ff93306303ba154246c85d92718c4/minsar/submit_jobs.bash#L322-L332
Why did sbatch_conditional
fail? I When it could not submit the job because there were too many running/pending jobs it should have waited 5 minutes.
I see that the 5 minutes waiting is not within the sbatch_conditional
function
https://github.com/geodesymiami/rsmas_insar/blob/caff6511106ff93306303ba154246c85d92718c4/minsar/submit_jobs.bash#L264-L277
Should it be?
This might be covered by Sara's sbatch_conditional.bash
suggestion. The script would submit the job, no matter what. Only it waits until all conditions are satisfied.
We might want to have a call as I am not sure we are on the same page. You said you fixed the timeout-max_number_of_jobs issue but I lost track of what you did and whether it really solved the issue.
We might want to have a call as I am not sure we are on the same page. You said you fixed the timeout-max_number_of_jobs issue but I lost track of what you did and whether it really solved the issue.
This is because you don't use issues properly and just continue on these huge threads about unrelated problems instead of keeping related thoughts in a single issue.
The results you're getting are as expected. Previously you requested that the script attempt to submit a job, and if the actual submissions fails (ie sbatch job_file.job
returns 0
) to wait temporarily and try again. If it fails 3 times consecutively, then exit. That is what happened and caused the previous termination. Is this no longer desired functionality?
The five minute wait is for if the job can not be submitted at all due to too many running jobs or too many tasks. In that case, we wait five minutes and try again, hoping resources have been freed to successfully submit.
Yes, these huge threads are not ideal........
The desired behavior is that it waits 5 minutes if it can't submit because of too many tasks/jobs. This applies for first-submits as well as re-submits.
I am trying to think where the 3 times consecutive failing comes from. Something like this applies to minsar_wrapper
and I do see it implemented ( I did not get a chance to test that yet).
At some time I might have suggested to submit and analyze the returned error message, but the way we do now is much better.
Hmmm... Do you remember how this came up? What be the benefits of trying 3 times? If a job has not been submitted after, say, 72 hours, that would be a reason to raise an exception.
@falkamelung How long do you want me to attempt to submit a job file for before exiting? Do you want to try a set number of times with increasingly long wait times each time (say 5 times total with additional 5min wait after each failure (so 5min, 10min, 15min, 20min, 25min)), or do you want to try for a specific amount of times (say 12 hours).
It should not exit. Put a 1-week limit. The reason why it has to wait is because I submitted too many jobs at the same time. This is exactly what I want to do. Submitting several big jobs and wait a week until they are done instead of submitting every day a new job.
Ok. I will let the script attempt to submit a job for 1 week (7 days). If, after a week, the constraints for job submission are still in place. I will leave it to exit with an error code.
@mirzaees please create a new issue with information about how you want a generalized sbatch_conditional
function to operate. Please include example inputs and outputs, requested command line options, and file naming conventions that this is intended to work with.
done
hi @Ovec8hkin : Let's also address #450 (comment), which is to limit the total load on the system.
We should get
and limit it to e.g.
max_tasks_total=1500