geodesymiami / rsmas_insar

RSMAS InSAR code
https://rsmas-insar.readthedocs.io/
GNU General Public License v3.0
62 stars 23 forks source link

limit to total number of tasks in submit_bash.job #455

Closed falkamelung closed 3 years ago

falkamelung commented 3 years ago

hi @Ovec8hkin : Let's also address #450 (comment), which is to limit the total load on the system.

We should get

total_num_active_tasks =   num_tasks_topo_reference +  num_tasks_secondary_slc + num_tasks_average_baseline + num_yasks_extract_burst_overlaps + ...

and limit it to e.g. max_tasks_total=1500

Ovec8hkin commented 3 years ago

This is simply akin to summing the total number of active tasks across all job steps and all projects correct?

falkamelung commented 3 years ago

Yes.

Ovec8hkin commented 3 years ago

Are you ever going to hit 1500 tasks if you also have the MAX_JOBS_PER_QUEUE check? I think that either the MAX_JOBS check or the stepwise max task check will catch an obscene number of tasks before a global maximum will.

In testing, the most tasks I can get to submit at once (with both the MAX_JOBS and MAX_TAKS_STEP checks), is <400.

Ovec8hkin commented 3 years ago

Also, I recommend you make a file somewhere that contains all of the step restrictions (number of tasks, io load constance, etc.). It'll be much more sustainable to source them from a file than having them hardcoded.

falkamelung commented 3 years ago

Yes we will. One dataset may have 20,000 tasks simultaneously. On stampede we never reach this because jobs are pending but on Frontera it may happen that you have tens of jobs running simultaneously (after somebody else's big job

The MAX_JOBS_PER_QUEUE is a system restriction. The standard on Frontera is 100 jobs. If each job runs 1000 tasks we end up with 100,000 tasks simultaneously....

During the TEXASCALE days you are allowed to run only if your code can run more than 2000 jobs simultaneously....

Ovec8hkin commented 3 years ago

Ok. Implemented and pushed.

falkamelung commented 3 years ago

Regarding specifying the task limits, did you see the comment https://github.com/geodesymiami/rsmas_insar/issues/450#issuecomment-756381093 ? This is how I think we should specify the load for each task. The MAX_TASKS_TOTAL will be specified in https://github.com/geodesymiami/rsmas_insar/blob/master/minsar/defaults/queues.cfg

But I first have to use it to be sure that is works as I envision it.

Ovec8hkin commented 3 years ago

Yes, that is fine, though I think such a file should only contain data needed for limiting job submissions (so no walltime stuff). I would need you to write the file with the appropriate step names, task maximums, and IO loads.

falkamelung commented 3 years ago

Yes, will do. It probably will be the same file. There are different max_tasks for different queues as they are different hardware. We should use the script that you created for platform_defaults.

Ovec8hkin commented 3 years ago

Can't do exactly the same thing, as we wouldn't want a unique environment variable for each step. I can probably use the read_platform_defaults as a base though. Likely I read the file into an associative array in bash and use it like that.

falkamelung commented 3 years ago

Hi @Ovec8hkin , it is not fully correct yet. It currently checks only for one step number (500 in this case). I did not see that total_max_tasks is set anywhere.

Jobfiles to run: /scratch/05861/tg851601/KokoxiliBigChunk30SenDT48/run_files/run_02_unpack_secondary_slc_0.job /scratch/05861/tg851601/KokoxiliBigChunk30SenDT48/run_files/run_02_unpack_secondary_slc_1.job /scratch/05861/tg851601/KokoxiliBigChunk30SenDT48/run_files/run_02_unpack_secondary_slc_2.job
unpack_secondary_slc: number of running/pending tasks is 0 (maximum 500)
/scratch/05861/tg851601/KokoxiliBigChunk30SenDT48/run_files/run_02_unpack_secondary_slc_0.job: 45 additional tasks
unpack_secondary_slc: 45 total tasks (maximum 500)
Number of running/pending jobs: 3
unpack_secondary_slc: number of running/pending tasks is 45 (maximum 500)
/scratch/05861/tg851601/KokoxiliBigChunk30SenDT48/run_files/run_02_unpack_secondary_slc_1.job: 45 additional tasks
unpack_secondary_slc: 90 total tasks (maximum 500)
Number of running/pending jobs: 4
unpack_secondary_slc: number of running/pending tasks is 90 (maximum 500)
/scratch/05861/tg851601/KokoxiliBigChunk30SenDT48/run_files/run_02_unpack_secondary_slc_2.job: 43 additional tasks
unpack_secondary_slc: 133 total tasks (maximum 500)
Number of running/pending jobs: 5
Jobs submitted: 7147543 7147544 7147545
KokoxiliBigChunk30SenDT48 run_02_unpack_secondary_slc_2.job, 7147543 is not finished yet. Current state is 'PENDING '

However, it should check for both step_max_tasks (the maximum number of tasks per step) and total_max_tasks(the total number of tasks). So we should have step_num_tasks <= step_max_tasks and total_num_tasks <= max_num_tasks.

So, if we have each 400 step1, step2 and step3 tasks running, we havetotal_num_tasks=1200 and can only submit 300 additional tasks (assuming total_max_tasks=1500).

When you commit again can you set all step_max_task_list entries to 500? Then the modifiable parameters are:

total_max_task = 1500

 step_max_task_list=(
        [unpack_topo_reference]=500
        [unpack_secondary_slc]=500
        [average_baseline]=500
        [extract_burst_overlaps]=500
        [overlap_geo2rdr]=500
        [overlap_resample]=500
        [pairs_misreg]=500
        [timeseries_misreg]=500
        [fullBurst_geo2rdr]=500
        [fullBurst_resample]=500
        [extract_stack_valid_region]=500
        [merge_reference_secondary_slc]=500
        [generate_burst_igram]=500
        [merge_burst_igram]=500
        [filter_coherence]=500
        [unwrap]=500
    )
falkamelung commented 3 years ago

Hi @Ovec8hkin , it does not work yet, unfortunately. Same bug as before. It seems to happen when lots of jobs are submitted and when the 25 max is reached....

It may be more effective to start the jobs in your area . If you have two completely done jobs it is easy to start lots of jobs by re-running one step (e.g. job_submit.bash $PWD --start 9 )

submit_jobs.bash /scratch/05861/tg851601/wd2KokoxiliBigChunk30SenDT48 --start 8
Jobfiles to run: /scratch/05861/tg851601/wd2KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_0.job /scratch/05861/tg851601/wd2KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_10.job /scratch/05861/tg851601/wd2KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_11.job /scratch/05861/tg851601/wd2KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_12.job /scratch/05861/tg851601/wd2KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_13.job /scratch/05861/tg851601/wd2KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_14.job /scratch/05861/tg851601/wd2KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_15.job /scratch/05861/tg851601/wd2KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_16.job /scratch/05861/tg851601/wd2KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_17.job /scratch/05861/tg851601/wd2KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_18.job /scratch/05861/tg851601/wd2KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_19.job /scratch/05861/tg851601/wd2KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_1.job /scratch/05861/tg851601/wd2KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_20.job /scratch/05861/tg851601/wd2KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_21.job /scratch/05861/tg851601/wd2KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_22.job /scratch/05861/tg851601/wd2KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_23.job /scratch/05861/tg851601/wd2KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_2.job /scratch/05861/tg851601/wd2KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_3.job /scratch/05861/tg851601/wd2KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_4.job /scratch/05861/tg851601/wd2KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_5.job /scratch/05861/tg851601/wd2KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_6.job /scratch/05861/tg851601/wd2KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_7.job /scratch/05861/tg851601/wd2KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_8.job /scratch/05861/tg851601/wd2KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_9.job
generate_burst_igram: number of running/pending tasks is 0 (maximum 500)
/scratch/05861/tg851601/wd2KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_0.job: 22 additional tasks
generate_burst_igram: 22 total tasks (maximum 500)
Number of running/pending jobs: 3
: number of running/pending tasks is 24 (maximum )
/scratch/05861/tg851601/KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_10.job: 44 additional tasks
: 68 total tasks (maximum )
Number of running/pending jobs: 1
Couldnt submit job (/scratch/05861/tg851601/KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_10.job), because there are 25 active jobs right now. Waiting 5 minutes to submit next job.
/work/05861/tg851601/stampede2/test/codedev1/rsmas_insar/minsar/submit_jobs.bash: line 53: step_max_task_list: bad array subscript
: number of running/pending tasks is 24 (maximum )
/scratch/05861/tg851601/KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_10.job: 44 additional tasks
: 68 total tasks (maximum )
Number of running/pending jobs: 1
Couldnt submit job (/scratch/05861/tg851601/KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_10.job), because there are 25 active jobs right now. Waiting 5 minutes to submit next job.
/work/05861/tg851601/stampede2/test/codedev1/rsmas_insar/minsar/submit_jobs.bash: line 53: step_max_task_list: bad array subscript
: number of running/pending tasks is 0 (maximum )
/scratch/05861/tg851601/KokoxiliBigChunk30SenDT48/run_files/run_08_generate_burst_igram_10.job: 44 additional tasks
: 44 total tasks (maximum )
Number of running/pending jobs: 0
Ovec8hkin commented 3 years ago

I don't think your using the most recent code. The logging output you're attaching doesn't match the updates I made yesterday.

falkamelung commented 3 years ago

Below the code I am using. The latest from master.

If I remove run_08_*10.job it works fine (see second case)

submit_jobs.bash $PWD --start 8
Jobfiles to run: /scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_0.job /scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_10.job /scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_1.job /scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_2.job /scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_3.job /scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_4.job /scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_5.job /scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_6.job /scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_7.job /scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_8.job /scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_9.job
generate_burst_igram: number of running/pending tasks is 0 (maximum 500)
/scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_0.job: 45 additional tasks
generate_burst_igram: 45 total tasks (maximum 500)
Number of running/pending jobs: 0
/work/05861/tg851601/stampede2/test/codedev1/rsmas_insar/minsar/submit_jobs.bash: line 53: step_max_task_list: bad array subscript
: number of running/pending tasks is 45 (maximum )
/scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_10.job: 45 additional tasks
: 90 total tasks (maximum )
Number of running/pending jobs: 1
Couldnt submit job (/scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_10.job), because there are 25 active jobs right now. Waiting 5 minutes to submit next job.
mv run_08_*10.job others

submit_jobs.bash $PWD --start 8
Jobfiles to run: /scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_0.job /scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_1.job /scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_2.job /scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_3.job /scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_4.job /scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_5.job /scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_6.job /scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_7.job /scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_8.job /scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_9.job
generate_burst_igram: number of running/pending tasks is 0 (maximum 500)
/scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_0.job: 45 additional tasks
generate_burst_igram: 45 total tasks (maximum 500)
Number of running/pending jobs: 0
generate_burst_igram: number of running/pending tasks is 45 (maximum 500)
/scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_1.job: 45 additional tasks
generate_burst_igram: 90 total tasks (maximum 500)
Number of running/pending jobs: 1
generate_burst_igram: number of running/pending tasks is 90 (maximum 500)
/scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_2.job: 45 additional tasks
generate_burst_igram: 135 total tasks (maximum 500)
Number of running/pending jobs: 2
generate_burst_igram: number of running/pending tasks is 135 (maximum 500)
/scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_3.job: 45 additional tasks
generate_burst_igram: 180 total tasks (maximum 500)
Number of running/pending jobs: 3
generate_burst_igram: number of running/pending tasks is 180 (maximum 500)
/scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_4.job: 45 additional tasks
generate_burst_igram: 225 total tasks (maximum 500)
Number of running/pending jobs: 4
generate_burst_igram: number of running/pending tasks is 225 (maximum 500)
/scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_5.job: 45 additional tasks
generate_burst_igram: 270 total tasks (maximum 500)
Number of running/pending jobs: 5
generate_burst_igram: number of running/pending tasks is 270 (maximum 500)
/scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_6.job: 45 additional tasks
generate_burst_igram: 315 total tasks (maximum 500)
Number of running/pending jobs: 6
generate_burst_igram: number of running/pending tasks is 315 (maximum 500)
/scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_7.job: 45 additional tasks
generate_burst_igram: 360 total tasks (maximum 500)
Number of running/pending jobs: 7
generate_burst_igram: number of running/pending tasks is 360 (maximum 500)
/scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_8.job: 45 additional tasks
generate_burst_igram: 405 total tasks (maximum 500)
Number of running/pending jobs: 8
generate_burst_igram: number of running/pending tasks is 405 (maximum 500)
/scratch/05861/tg851601/KokoxiliBigChunk31SenDT48/run_files/run_08_generate_burst_igram_9.job: 45 additional tasks
generate_burst_igram: 450 total tasks (maximum 500)
Number of running/pending jobs: 9
Jobs submitted: 7149864 7149865 7149867 7149868 7149869 7149870 7149871 7149872 7149873 7149874
cat /work/05861/tg851601/stampede2/test/codedev1/rsmas_insar/minsar/submit_jobs.bash
##! /bin/bash
#set -x
#trap read debug

function compute_tasks_for_step {
    file=$1
    stepname=$2

    IFS=$'\n'
    running_tasks=$(squeue -u $USER --format="%j %A" -rh)
    job_ids=($(echo $running_tasks | grep -oP "\d{4,}"))

    unset IFS

    tasks=0
    for j in "${job_ids[@]}"; do
    task_file=$(scontrol show jobid -dd $j | grep -oP "(?<=Command=)(.*)(?=.job)")
    if [[ "$task_file" == *"$stepname"* ]]; then
        num_tasks=$(cat $task_file | wc -l)
        ((tasks=tasks+$num_tasks))
    fi
    done

    echo $tasks
    return 0
}

function submit_job_conditional {

    declare -A step_max_task_list
    step_max_task_list=(
    [unpack_topo_reference]=500
    [unpack_secondary_slc]=500 
    [average_baseline]=500 
    [extract_burst_overlaps]=500 
    [overlap_geo2rdr]=140 
    [overlap_resample]=140 
    [pairs_misreg]=50 
    [timeseries_misreg]=100 
    [fullBurst_geo2rdr]=500 
    [fullBurst_resample]=500 
    [extract_stack_valid_region]=500 
    [merge_reference_secondary_slc]=500 
    [generate_burst_igram]=500 
    [merge_burst_igram]=500 
    [filter_coherence]=500 
    [unwrap]=500
    )

    file=$1

    step_name=$(echo $file | grep -oP "(?<=run_\d{2}_)(.*)(?=_\d{1}.job)")
    step_max_tasks="${step_max_task_list[$step_name]}"

    num_active_tasks=$(compute_tasks_for_step $file $step_name)
    num_tasks_job=$(cat ${file%.*} | wc -l)
    total_tasks=$(($num_active_tasks+$num_tasks_job))

    echo "$step_name: number of running/pending tasks is $num_active_tasks (maximum $step_max_tasks)" >&2
    echo "$file: $num_tasks_job additional tasks" >&2
    echo "$step_name: $total_tasks total tasks (maximum $step_max_tasks)" >&2

    num_active_jobs=$(squeue -u $USER -h -t running,pending -r | wc -l )
    echo "Number of running/pending jobs: $num_active_jobs" >&2 

    if [[ $num_active_jobs -lt $MAX_JOBS_PER_QUEUE ]] && [[ $total_tasks -lt $step_max_tasks ]]; then
        job_submit_message=$(sbatch $file)
        exit_status="$?"
        if [[ $exit_status -ne 0 ]]; then
            echo "sbatch message: $job_submit_message" >&2
            echo "sbatch submit error: exit code $exit_status. Sleep 30 seconds and try again" >&2
            sleep 30
            job_submit_message=$(sbatch $file | grep "Submitted batch job")
            exit_status="$?"
            if [[ $exit_status -ne 0 ]]; then
                echo "sbatch error message: $job_submit_message" >&2
                echo "sbatch submit error: exit code $exit_status. Sleep 60 seconds and try again" >&2
                sleep 60
                job_submit_message=$(sbatch $file | grep "Submitted batch job")
                exit_status="$?"
                if [[ $exit_status -ne 0 ]]; then
                    echo "sbatch error message: $job_submit_message" >&2
                    echo "sbatch submit error again: exit code $exit_status. Exiting." >&2
                    exit 1
                fi
            fi
        fi

        jobnumber=$(grep -oE "[0-9]{7}" <<< $job_submit_message)

        echo $jobnumber
        return 0

    fi

    return 1

}

if [[ "$1" == "--help" || "$1" == "-h" ]]; then
helptext="                                                                         \n\
Job submission script
usage: submit_jobs.bash custom_template_file [--start] [--stop] [--dostep] [--help]\n\
                                                                                   \n\
  Examples:                                                                        \n\
      submit_jobs.bash \$SAMPLESDIR/unittestGalapagosSenDT128.template              \n\
      submit_jobs.bash \$SAMPLESDIR/unittestGalapagosSenDT128.template --start 2    \n\
      submit_jobs.bash \$SAMPLESDIR/unittestGalapagosSenDT128.template --dostep 4   \n\
      submit_jobs.bash \$SAMPLESDIR/unittestGalapagosSenDT128.template --stop 8     \n\
      submit_jobs.bash \$SAMPLESDIR/unittestGalapagosSenDT128.template --start timeseries \n\
      submit_jobs.bash \$SAMPLESDIR/unittestGalapagosSenDT128.template --dostep insarmaps \n\
                                                                                   \n\
 Processing steps (start/end/dostep): \n\
                                                                                 \n\
   ['1-16', 'timeseries', 'insarmaps' ]                                          \n\
                                                                                 \n\
   In order to use either --start or --dostep, it is necessary that a            \n\
   previous run was done using one of the steps options to process at least      \n\
   through the step immediately preceding the starting step of the current run.  \n\
                                                                                 \n\
   --start STEP          start processing at the named step [default: load_data].\n\
   --end STEP, --stop STEP                                                       \n\
                         end processing at the named step [default: upload]      \n\
   --dostep STEP         run processing at the named step only                   \n 
     "
    printf "$helptext"
    exit 0;
else
    PROJECT_NAME=$(basename "$1" | cut -d. -f1)
fi
WORKDIR=$SCRATCHDIR/$PROJECT_NAME
RUNFILES_DIR=$WORKDIR"/run_files"

cd $WORKDIR

startstep=1
stopstep="insarmaps"

start_datetime=$(date +"%Y%m%d:%H-%m")
echo "${start_datetime} * submit_jobs.bash ${WORKDIR} ${@:2}" >> "${WORKDIR}"/log

while [[ $# -gt 0 ]]
do
    key="$1"

    case $key in
        --start)
            startstep="$2"
            shift # past argument
            shift # past value
            ;;
    --stop)
            stopstep="$2"
            shift
            shift
            ;;
    --dostep)
            startstep="$2"
            stopstep="$2"
            shift
            shift
            ;;
        *)
            POSITIONAL+=("$1") # save it in an array for later
            shift # past argument
            ;;
esac
done
set -- "${POSITIONAL[@]}" # restore positional parameters

#find the last job (11 for 'geometry' and 16 for 'NESD')
job_file_arr=(run_files/run_*_0.job)
last_job_file=${job_file_arr[-1]}
last_job_file_number=${last_job_file:14:2}

if [[ $startstep == "ifgrams" ]]; then
    startstep=1
elif [[ $startstep == "timeseries" ]]; then
    startstep=$((last_job_file_number+1))
elif [[ $startstep == "insarmaps" ]]; then
    startstep=$((last_job_file_number+2))
fi

if [[ $stopstep == "ifgrams" ]]; then
    stopstep=$last_job_file_number
elif [[ $stopstep == "timeseries" ]]; then
    stopstep=$((last_job_file_number+1))
elif [[ $stopstep == "insarmaps" ]]; then
    stopstep=$((last_job_file_number+2))
fi

for (( i=$startstep; i<=$stopstep; i++ )) do
    stepnum="$(printf "%02d" ${i})"
    if [[ $i -le $last_job_file_number ]]; then
    fname="$RUNFILES_DIR/run_${stepnum}_*.job"
    elif [[ $i -eq $((last_job_file_number+1)) ]]; then
    fname="$WORKDIR/smallbaseline_wrapper*.job"
    else
    fname="$WORKDIR/insarmaps*.job"
    fi
    globlist+=("$fname")
done

for g in "${globlist[@]}"; do
    files=($g)
    echo "Jobfiles to run: ${files[@]}"

    # Submit all of the jobs and record all of their job numbers
    jobnumbers=()
    for (( f=0; f < "${#files[@]}"; f++ )); do
    file=${files[$f]}

        jobnumber=$(submit_job_conditional $file)
        exit_status="$?"
        if [[ $exit_status -eq 0 ]]; then
            jobnumbers+=("$jobnumber")
        else
            echo "Couldnt submit job (${file}), because there are $MAX_JOBS_PER_QUEUE active jobs right now. Waiting 5 minutes to submit next job."
            f=$((f-1))
            sleep 300 # sleep for 5 minutes
        fi

    done

    echo "Jobs submitted: ${jobnumbers[@]}"
    sleep 5
    # Wait for each job to complete
    for (( j=0; j < "${#jobnumbers[@]}"; j++)); do
        jobnumber=${jobnumbers[$j]}

        # Parse out the state of the job from the sacct function.
        # Format state to be all uppercase (PENDING, RUNNING, or COMPLETED)
        # and remove leading whitespace characters.
        state=$(sacct --format="State" -j $jobnumber | grep "\w[[:upper:]]\w")
        state="$(echo -e "${state}" | sed -e 's/^[[:space:]]*//')"

        # Keep checking the state while it is not "COMPLETED"
        secs=0
        while true; do

            # Only print every so often, not every 30 seconds
            if [[ $(( $secs % 30)) -eq 0 ]]; then
                echo "$(basename $WORKDIR) $(basename "$file"), ${jobnumber} is not finished yet. Current state is '${state}'"
            fi

            state=$(sacct --format="State" -j $jobnumber | grep "\w[[:upper:]]\w")
            state="$(echo -e "${state}" | sed -e 's/^[[:space:]]*//')"

                # Check if "COMPLETED" is anywhere in the state string variables.
                # This gets rid of some strange special character issues.
            if [[ $state == *"TIMEOUT"* ]] && [[ $state != "" ]]; then
                jf=${files[$j]}
                init_walltime=$(grep -oP '(?<=#SBATCH -t )[0-9]+:[0-9]+:[0-9]+' $jf)

                echo "${jobnumber} timedout due to too low a walltime (${init_walltime})."

                # Compute a new walltime and update the job file
                update_walltime.py "$jf" &> /dev/null

                updated_walltime=$(grep -oP '(?<=#SBATCH -t )[0-9]+:[0-9]+:[0-9]+' $jf)

                datetime=$(date +"%Y-%m-%d:%H-%M")
                echo "${datetime}: re-running: ${jf}: ${init_walltime} --> ${updated_walltime}" >> "${RUNFILES_DIR}"/rerun.log

                echo "Resubmitting file (${jf}) with new walltime of ${updated_walltime}"

                # Resubmit as a new job number
                jobnumber=$(submit_job_conditional $jf)
                exit_status="$?"
        if [[ $exit_status -eq 0 ]]; then
            jobnumbers+=("$jobnumber")
            files+=("$jf")
            echo "${jf} resubmitted as jobumber: ${jobnumber}"
        else
            echo "sbatch re-submit error message: $jobnumber"
                    echo "sbatch re-submit error: exit code $exit_status. Exiting."
                    exit 1
        fi

                break;

        elif [[ $state == *"COMPLETED"* ]] && [[ $state != "" ]]; then
                state="COMPLETED"
                echo "${jobnumber} is complete"
                break;

            elif [[ ( $state == *"FAILED"* ) &&  $state != "" ]]; then
                echo "${jobnumber} FAILED. Exiting with status code 1."
                exit 1; 
            elif [[ ( $state ==  *"CANCELLED"* ) &&  $state != "" ]]; then
                echo "${jobnumber} was CANCELLED. Exiting with status code 1."
                exit 1; 
            fi

            # Wait for 30 second before chcking again
            sleep 30
            ((secs=secs+30))

            done

        echo Job"${jobnumber} is finished."

    done

    # Run check_job_output.py on all files
    cmd="check_job_outputs.py  ${files[@]}"
    echo "$cmd"
    $cmd
       exit_status="$?"
       if [[ $exit_status -ne 0 ]]; then
            echo "Error in submit_jobs.bash: check_job_outputs.py exited with code ($exit_status)."
            exit 1
       fi
done
Ovec8hkin commented 3 years ago

That's not the most updated code. If you view the file on GitHub, the most recent copy starts with a function named compute_total_tasks.

falkamelung commented 3 years ago

Hah, I did something stupid. My bad!

Yes, it works now. Great! This is big progress! Thank you !!!!

Two minor items:

  1. For total_tasks it stays below 1500. This is what it should do. For step_tasks it goes above 500. Does it really do this or does it display the number of tasks including the job it wants to submit? It would be preferred to stay below 500.
/scratch/05861/tg851601/KilaueaSenAT124/run_files/run_15_filter_coherence_2.job: 88 additional tasks
filter_coherence: 438 total tasks (maximum 500)
1361 tasks running across all jobs (maximum 1500)
Number of running/pending jobs: 20
1361 tasks running across all jobs (maximum 1500)
filter_coherence: 438 running/pending tasks (maximum 500)
/scratch/05861/tg851601/KilaueaSenAT124/run_files/run_15_filter_coherence_3.job: 88 additional tasks
filter_coherence: 526 total tasks (maximum 500)
1449 tasks running across all jobs (maximum 1500)
Number of running/pending jobs: 21
Couldnt submit job (/scratch/05861/tg851601/KilaueaSenAT124/run_files/run_15_filter_coherence_3.job), because there are too many active tasks for this step right now. Waiting 5 minutes to submit next job.
  1. The messaging is still not fully consistent. Is there a difference between generate_burst_igram: 485 running/pending tasks and generate_burst_igram: 573 total tasks ?
    submit_jobs.bash $PWD --start 13
    Jobfiles to run: /scratch/05861/tg851601/KilaueaSenAT124/run_files/run_13_generate_burst_igram_0.job /scratch/05861/tg851601/KilaueaSenAT124/run_files/run_13_generate_burst_igram_10.job /scratch/05861/tg851601/KilaueaSenAT124/run_files/run_13_generate_burst_igram_11.job /scratch/05861/tg851601/KilaueaSenAT124/run_files/run_13_generate_burst_igram_1.job /scratch/05861/tg851601/KilaueaSenAT124/run_files/run_13_generate_burst_igram_2.job /scratch/05861/tg851601/KilaueaSenAT124/run_files/run_13_generate_burst_igram_3.job /scratch/05861/tg851601/KilaueaSenAT124/run_files/run_13_generate_burst_igram_4.job /scratch/05861/tg851601/KilaueaSenAT124/run_files/run_13_generate_burst_igram_5.job /scratch/05861/tg851601/KilaueaSenAT124/run_files/run_13_generate_burst_igram_6.job /scratch/05861/tg851601/KilaueaSenAT124/run_files/run_13_generate_burst_igram_7.job /scratch/05861/tg851601/KilaueaSenAT124/run_files/run_13_generate_burst_igram_8.job /scratch/05861/tg851601/KilaueaSenAT124/run_files/run_13_generate_burst_igram_9.job
    485 tasks running across all jobs (maximum 1500)
    generate_burst_igram: 485 running/pending tasks (maximum 500)
    /scratch/05861/tg851601/KilaueaSenAT124/run_files/run_13_generate_burst_igram_0.job: 88 additional tasks
    generate_burst_igram: 573 total tasks (maximum 500)
    573 tasks running across all jobs (maximum 1500)
    Number of running/pending jobs: 11

Can we always use running/pending as this is what it is? In the script the variable name preferably should contain active (you might have that already). For messaging, how about:

generate_burst_igram: 573 running/pending tasks (maximum 500)
all jobs: 1361  running/pending tasks (maximum 1500)
Ovec8hkin commented 3 years ago

There is a subtle difference between the two lines of logging:

>>> generate_burst_igram: 485 running/pending tasks (maximum 500)
>>> /scratch/05861/tg851601/KilaueaSenAT124/run_files/run_13_generate_burst_igram_0.job: 88 additional tasks
>>> generate_burst_igram: 573 total tasks (maximum 500)

Line 1 indicates the number of running/pending tasks for the specified job NOT including the job that we are attempting to submit at that moment. Line 2 indicates the number of tasks associated with the job we are attempting to submit. Line 3 indicates the number of tasks that WOULD be running if the job is successfully submitted. If the number of tasks reported on line 3 exceeds the maximum number for the step, the job does not submit.

falkamelung commented 3 years ago

That is what I thought. So line 3 is not needed, right?

Ovec8hkin commented 3 years ago

Not technically no. I leave it there for debugging purposes. You can comment it out if you would like.

falkamelung commented 3 years ago

Are you going to commit anything on this file? Else I will so some minor modifications before I forget.

Ovec8hkin commented 3 years ago

I have no pending changes to make here.

falkamelung commented 3 years ago

Hi @Ovec8hkin

There is one bug left. When you run --start timeseries (or --start insarmaps it does not work because the job files don't start with run_. This line https://github.com/geodesymiami/rsmas_insar/blob/d1c57b3219cd2a5a36cfaef04c794fc7685cc89c/minsar/submit_jobs.bash#L79 creates an error. A possible solution would be to first check whether the jobname contains `run. Else skip the conditional completely fortimeseriesandinsarmaps`

submit_jobs.bash /scratch/05861/tg851601/unittestGalapagosSenDT128 --start timeseries
Jobfiles to run: /scratch/05861/tg851601/unittestGalapagosSenDT128/smallbaseline_wrapper.job
/work/05861/tg851601/stampede2/test/codedev1/rsmas_insar/minsar/submit_jobs.bash: line 80: step_io_load_list: bad array subscript
(standard_in) 2: syntax error

I cleant the naming up a bit and switched to IO_load (it is used to calculate step_max_tasks)

Ovec8hkin commented 3 years ago

I just pushed thus.

falkamelung commented 3 years ago

Hi @Ovec8hkin Can you please check whether the re-running is properly implemented when the maximum number of tasks is reached? In the case below it does not. This looks like the previous issue with re-running not working when 25 jobs were reached.

If you don't see the issue quickly I have to do some efforts to recreate.

On a related issue, looking at the code and suggested by Sara, I think we should remove the submit_jobs_conditional function from submit_jobs.bash and make it its own script sbatch_conditional.bash. That will make submit_jobs.bash much lighter. If we need to submit a job we could use either sbatch or sbatch_conditional. That was

The regular sbatch output is at the end of this comment. sbatch_conditional.bash would show something similar but with the tasks checks and potentially showing the waiting messages. The regular `sbatch fails if the job limit is reached. Ours does not fail but waits and then tries again. How do you think? I like this idea a lot!

708 running/pending tasks across all jobs (maximum 2000)
step generate_burst_igram: 46 running/pending tasks (maximum 500)
run_08_generate_burst_igram_8.job: 46 additional tasks
Number of running/pending jobs: 17
754 running/pending tasks across all jobs (maximum 2000)
step generate_burst_igram: 92 running/pending tasks (maximum 500)
run_08_generate_burst_igram_9.job: 46 additional tasks
Jobs submitted: 7175671 7175672 7175673 7175674 7175675 7175676 7175677 7175678 7175679 7175701 7175718 7175719
KokoxiliBigChunk40SenDT48 run_08_generate_burst_igram_9.job, 7175671 is not finished yet. Current state is 'TIMEOUT '
7175671 timedout due to too low a walltime (0:18:40).
Resubmitting file (/scratch/05861/tg851601/KokoxiliBigChunk40SenDT48/run_files/run_08_generate_burst_igram_0.job) with new walltime of 00:22:24
Number of running/pending jobs: 18
800 running/pending tasks across all jobs (maximum 2000)
step generate_burst_igram: 138 running/pending tasks (maximum 500)
run_08_generate_burst_igram_0.job: 46 additional tasks
/scratch/05861/tg851601/KokoxiliBigChunk40SenDT48/run_files/run_08_generate_burst_igram_0.job resubmitted as jobumber: 7175720
Job7175720 is finished.
KokoxiliBigChunk40SenDT48 run_08_generate_burst_igram_9.job, 7175672 is not finished yet. Current state is 'TIMEOUT '
7175672 timedout due to too low a walltime (0:18:40).
Resubmitting file (/scratch/05861/tg851601/KokoxiliBigChunk40SenDT48/run_files/run_08_generate_burst_igram_10.job) with new walltime of 00:22:24
Number of running/pending jobs: 19
846 running/pending tasks across all jobs (maximum 2000)
step generate_burst_igram: 184 running/pending tasks (maximum 500)
run_08_generate_burst_igram_10.job: 46 additional tasks
/scratch/05861/tg851601/KokoxiliBigChunk40SenDT48/run_files/run_08_generate_burst_igram_10.job resubmitted as jobumber: 7175722
Job7175722 is finished.
KokoxiliBigChunk40SenDT48 run_08_generate_burst_igram_9.job, 7175673 is not finished yet. Current state is 'TIMEOUT '
7175673 timedout due to too low a walltime (0:18:40).
Resubmitting file (/scratch/05861/tg851601/KokoxiliBigChunk40SenDT48/run_files/run_08_generate_burst_igram_11.job) with new walltime of 00:22:24
Number of running/pending jobs: 20
892 running/pending tasks across all jobs (maximum 2000)
step generate_burst_igram: 230 running/pending tasks (maximum 500)
run_08_generate_burst_igram_11.job: 40 additional tasks
/scratch/05861/tg851601/KokoxiliBigChunk40SenDT48/run_files/run_08_generate_burst_igram_11.job resubmitted as jobumber: 7175723
Job7175723 is finished.
KokoxiliBigChunk40SenDT48 run_08_generate_burst_igram_9.job, 7175674 is not finished yet. Current state is 'TIMEOUT '
7175674 timedout due to too low a walltime (0:18:40).
Resubmitting file (/scratch/05861/tg851601/KokoxiliBigChunk40SenDT48/run_files/run_08_generate_burst_igram_1.job) with new walltime of 00:22:24
Number of running/pending jobs: 21
932 running/pending tasks across all jobs (maximum 2000)
step generate_burst_igram: 270 running/pending tasks (maximum 500)
run_08_generate_burst_igram_1.job: 46 additional tasks
/scratch/05861/tg851601/KokoxiliBigChunk40SenDT48/run_files/run_08_generate_burst_igram_1.job resubmitted as jobumber: 7175724
Job7175724 is finished.
KokoxiliBigChunk40SenDT48 run_08_generate_burst_igram_9.job, 7175675 is not finished yet. Current state is 'TIMEOUT '
7175675 timedout due to too low a walltime (0:18:40).
Resubmitting file (/scratch/05861/tg851601/KokoxiliBigChunk40SenDT48/run_files/run_08_generate_burst_igram_2.job) with new walltime of 00:22:24
Number of running/pending jobs: 22
978 running/pending tasks across all jobs (maximum 2000)
step generate_burst_igram: 316 running/pending tasks (maximum 500)
run_08_generate_burst_igram_2.job: 46 additional tasks
/scratch/05861/tg851601/KokoxiliBigChunk40SenDT48/run_files/run_08_generate_burst_igram_2.job resubmitted as jobumber: 7175725
Job7175725 is finished.
KokoxiliBigChunk40SenDT48 run_08_generate_burst_igram_9.job, 7175676 is not finished yet. Current state is 'TIMEOUT '
7175676 timedout due to too low a walltime (0:18:40).
Resubmitting file (/scratch/05861/tg851601/KokoxiliBigChunk40SenDT48/run_files/run_08_generate_burst_igram_3.job) with new walltime of 00:22:24
Number of running/pending jobs: 23
1024 running/pending tasks across all jobs (maximum 2000)
step generate_burst_igram: 362 running/pending tasks (maximum 500)
run_08_generate_burst_igram_3.job: 46 additional tasks
/scratch/05861/tg851601/KokoxiliBigChunk40SenDT48/run_files/run_08_generate_burst_igram_3.job resubmitted as jobumber: 7175736
Job7175736 is finished.
KokoxiliBigChunk40SenDT48 run_08_generate_burst_igram_9.job, 7175677 is not finished yet. Current state is 'TIMEOUT '
7175677 timedout due to too low a walltime (0:18:40).
Resubmitting file (/scratch/05861/tg851601/KokoxiliBigChunk40SenDT48/run_files/run_08_generate_burst_igram_4.job) with new walltime of 00:22:24
Number of running/pending jobs: 24
1070 running/pending tasks across all jobs (maximum 2000)
step generate_burst_igram: 408 running/pending tasks (maximum 500)
run_08_generate_burst_igram_4.job: 46 additional tasks
/scratch/05861/tg851601/KokoxiliBigChunk40SenDT48/run_files/run_08_generate_burst_igram_4.job resubmitted as jobumber: 7175737
Job7175737 is finished.
KokoxiliBigChunk40SenDT48 run_08_generate_burst_igram_9.job, 7175678 is not finished yet. Current state is 'TIMEOUT '
7175678 timedout due to too low a walltime (0:18:40).
Resubmitting file (/scratch/05861/tg851601/KokoxiliBigChunk40SenDT48/run_files/run_08_generate_burst_igram_5.job) with new walltime of 00:22:24
Number of running/pending jobs: 25
1116 running/pending tasks across all jobs (maximum 2000)
step generate_burst_igram: 454 running/pending tasks (maximum 500)
run_08_generate_burst_igram_5.job: 46 additional tasks
sbatch re-submit error message: 
sbatch re-submit error: exit code 1. Exiting.

Regular sbatch output

idevdev

 -> Checking on the status of skx-dev queue. OK

 -> Defaults file    : ~/.idevrc    
 -> System           : stampede2    
 -> Queue            : skx-dev        (cmd line: -p        )
 -> Nodes            : 1              (cmd line: -N        )
 -> Total tasks      : 48             (cmd line: -n        )
 -> Time (minutes)   : 30             (idev  default       )
 -> Project          : TG-EAR200012   (~/.idevrc           )

-----------------------------------------------------------------
          Welcome to the Stampede2 Supercomputer                 
-----------------------------------------------------------------

No reservation for this job
--> Verifying valid submit host (login3)...OK
--> Verifying valid jobname...OK
--> Enforcing max jobs per user...OK
--> Verifying availability of your home dir (/home1/05861/tg851601)...OK
--> Verifying availability of your work dir (/work/05861/tg851601/stampede2)...OK
--> Verifying availability of your scratch dir (/scratch/05861/tg851601)...OK
--> Verifying valid ssh keys...OK
--> Verifying access to desired queue (skx-dev)...OK
--> Verifying job request is within current queue limits...OK
--> Checking available allocation (TG-EAR200012)...OK
--> Verifying that quota for filesystem /home1/05861/tg851601 is at 82.76% allocated...OK
--> Verifying that quota for filesystem /work/05861/tg851601/stampede2 is at 34.19% allocated...OK
Submitted batch job 7176714
mirzaees commented 3 years ago

Hi @falkamelung and @Ovec8hkin

Yes, this would be really helpful if implemented. I need sbatch_conditional.bash to receive a batch file (with any naming convention) and submit all corresponding jobs (numbered like *_1.job). And then it does all the checking for number of tasks and fails.

for example I have a batch file named: run_my_batchfile and I have split the tasks in multiple jobs like: run_my_batchfile_0.job, run_my_batchfile_1.job, ... now I want to submit them using sbatch_conditional.bash: sbatch_conditional.bash run_my_batchfile

it should take care of finding the jobs, submitting and checking everything

Ovec8hkin commented 3 years ago

@falkamelung Whats the problem here? I don't see the task limits being exceeded. Im using the sbatch_conditional for resubmission, so all of the checks are performed if there is a TIMEOUT error.

falkamelung commented 3 years ago

It says sbatch re-submit error: exit code 1 and then it exits. It should not exit. Did you see that?

Ovec8hkin commented 3 years ago

It throws that error message and exits after attempting submit three times and failing. If you want to change the wait times, there in the sbatch_conditional function.

falkamelung commented 3 years ago

This is not clear to me yet. The message above was produced here: https://github.com/geodesymiami/rsmas_insar/blob/caff6511106ff93306303ba154246c85d92718c4/minsar/submit_jobs.bash#L322-L332

Why did sbatch_conditional fail? I When it could not submit the job because there were too many running/pending jobs it should have waited 5 minutes.

I see that the 5 minutes waiting is not within the sbatch_conditional function https://github.com/geodesymiami/rsmas_insar/blob/caff6511106ff93306303ba154246c85d92718c4/minsar/submit_jobs.bash#L264-L277 Should it be?

This might be covered by Sara's sbatch_conditional.bash suggestion. The script would submit the job, no matter what. Only it waits until all conditions are satisfied.

We might want to have a call as I am not sure we are on the same page. You said you fixed the timeout-max_number_of_jobs issue but I lost track of what you did and whether it really solved the issue.

Ovec8hkin commented 3 years ago
We might want to have a call as I am not sure we are on the same page. You said you fixed the timeout-max_number_of_jobs issue but I lost track of what you did and whether it really solved the issue.

This is because you don't use issues properly and just continue on these huge threads about unrelated problems instead of keeping related thoughts in a single issue.

The results you're getting are as expected. Previously you requested that the script attempt to submit a job, and if the actual submissions fails (ie sbatch job_file.job returns 0) to wait temporarily and try again. If it fails 3 times consecutively, then exit. That is what happened and caused the previous termination. Is this no longer desired functionality?

The five minute wait is for if the job can not be submitted at all due to too many running jobs or too many tasks. In that case, we wait five minutes and try again, hoping resources have been freed to successfully submit.

falkamelung commented 3 years ago

Yes, these huge threads are not ideal........

The desired behavior is that it waits 5 minutes if it can't submit because of too many tasks/jobs. This applies for first-submits as well as re-submits.

I am trying to think where the 3 times consecutive failing comes from. Something like this applies to minsar_wrapper and I do see it implemented ( I did not get a chance to test that yet).
At some time I might have suggested to submit and analyze the returned error message, but the way we do now is much better.

Hmmm... Do you remember how this came up? What be the benefits of trying 3 times? If a job has not been submitted after, say, 72 hours, that would be a reason to raise an exception.

Ovec8hkin commented 3 years ago

@falkamelung How long do you want me to attempt to submit a job file for before exiting? Do you want to try a set number of times with increasingly long wait times each time (say 5 times total with additional 5min wait after each failure (so 5min, 10min, 15min, 20min, 25min)), or do you want to try for a specific amount of times (say 12 hours).

falkamelung commented 3 years ago

It should not exit. Put a 1-week limit. The reason why it has to wait is because I submitted too many jobs at the same time. This is exactly what I want to do. Submitting several big jobs and wait a week until they are done instead of submitting every day a new job.

Ovec8hkin commented 3 years ago

Ok. I will let the script attempt to submit a job for 1 week (7 days). If, after a week, the constraints for job submission are still in place. I will leave it to exit with an error code.

Ovec8hkin commented 3 years ago

@mirzaees please create a new issue with information about how you want a generalized sbatch_conditional function to operate. Please include example inputs and outputs, requested command line options, and file naming conventions that this is intended to work with.

falkamelung commented 3 years ago

done