Advised by Si Li@TACC I added module load python_cacher to the jobfiles. Unfortunately, the jobs either fail (error message Fail to open new file or take very long (I have not investigated in which cases which of the two occurs). Here three cases are documented. (1) Job files as suggested by Si, (2) the original job file which works fine, and (3) a previous iteration suugested by TACC ( module load python_cacher and an export LD_PRELOAD command:
On an interactive node I said module load python_cacher and got the following error message:
/scratch/05861/tg851601/IsraelBig40SenDT21/run_files_err07_module_python_cacher[1007] SentinelWrapper.py -c /scratch/05861/tg851601/IsraelBig40SenDT21/configs/config_misreg_20160118_20160130
pDir_Info[idx_Rec].Offset > MAX_ENTRY_BUFF_LEN
You need to increase MAX_ENTRY_BUFF_LEN
Quit
WIthout python cacher iIt works fine. Below the job file and run times. That the first run_07 job (run_07_pairs_misreg_0) takes twice the time compared the others (run_07_pairs_misreg_1) is indication for a significant problem but unrelated to python_cacher.
3. Previously I had `module load python_cacher ` and `export LD_PRELOAD=/home1/apps/tacc-patches/python_cacher/myopen.so` (as advised by TACC) but launcher throws a Bus error. My guess is larger memory usage (Si said 400 MB per job, corresponding to ~20GB for 48 jobs running on one node). I did not try systematically whether giving the jobs more memory would resolve the problem.
#########################
using /tmp/launcher.6867976.hostlist.7K3t6G1u to get hosts
starting job on c489-031
starting job on c490-134
#########################
#########################
using /tmp/launcher.6867977.hostlist.56JwfLi4 to get hosts
starting job on c491-071
starting job on c491-072
#########################
#########################
using /tmp/launcher.6867978.hostlist.vi7ZSr3j to get hosts
starting job on c491-082
starting job on c491-093
#########################
#########################
using /tmp/launcher.6867979.hostlist.x21luGCy to get hosts
starting job on c477-044
starting job on c477-051
#########################
#########################
using /tmp/launcher.6867980.hostlist.8bfEnA9e to get hosts
starting job on c496-094
starting job on c499-103
#########################
#########################
using /tmp/launcher.6867981.hostlist.recY4lVQ to get hosts
starting job on c502-034
starting job on c502-041
#########################
#########################
using /tmp/launcher.6867982.hostlist.FAK4Dqw1 to get hosts
starting job on c500-132
starting job on c502-014
#########################
#########################
using /tmp/launcher.6867983.hostlist.OpcNm9pn to get hosts
starting job on c504-111
starting job on c504-112
/opt/apps/launcher/launcher-3.7/launcher: line 93: 218863 Bus error SentinelWrapper.py -c /scratch/05861/tg851601/IsraelBig40SenDT21/configs/config_misreg_20191210_20191222 > /scratch/05861/tg851601/IsraelBig40SenDT21/run_files/run_07_pairs_misreg_7_2019121020191222$LAUNCHER_JID.o 2> /scratch/05861/tg851601/IsraelBig40SenDT21/run_files/run_07_pairs_misreg_7_2019121020191222$LAUNCHER_JID.e
#########################
#########################
using /tmp/launcher.6868003.hostlist.NC1XIGf5 to get hosts
starting job on c504-113
starting job on c504-114
#########################
Advised by Si Li@TACC I added
module load python_cacher
to the jobfiles. Unfortunately, the jobs either fail (error messageFail to open new file
or take very long (I have not investigated in which cases which of the two occurs). Here three cases are documented. (1) Job files as suggested by Si, (2) the original job file which works fine, and (3) a previous iteration suugested by TACC (module load python_cacher
and anexport LD_PRELOAD
command:On an interactive node I said
module load python_cacher
and got the following error message:module load launcher export OMP_NUM_THREADS=2 export PATH=/scratch/05861/tg851601/code2/rsmas_insar/sources/isce2/contrib/stack/topsStack:$PATH export LAUNCHER_WORKDIR=/scratch/05861/tg851601/IsraelBig40SenDT21 export LAUNCHER_PPN=44
export LAUNCHER_JOB_FILE=/scratch/05861/tg851601/IsraelBig40SenDT21/run_files/run_07_pairs_misreg_0
$LAUNCHER_DIR/paramrun
Number of bursts: 24 NNodes Timelimit Reserved Elapsed Time_per_burst run_07_pairs_misreg_0 2 00:26:00 02:46:08 00:19:59 00:00:49 run_07_pairs_misreg_1 2 00:26:00 05:19:14 00:09:38 00:00:24 run_07_pairs_misreg_2 2 00:26:00 05:19:11 00:09:35 00:00:23 run_07_pairs_misreg_3 2 00:26:00 05:19:09 00:09:24 00:00:23 run_07_pairs_misreg_4 2 00:26:00 05:19:06 00:09:39 00:00:24 run_07_pairs_misreg_5 2 00:26:00 05:19:06 00:09:16 00:00:23 run_07_pairs_misreg_6 2 00:26:00 05:19:01 00:09:43 00:00:24 run_07_pairs_misreg_7 2 00:26:00 05:31:02 00:08:34 00:00:21 run_07_pairs_misreg_8 2 00:38:00 04:48:32 00:00:48 00:00:02
//login3/scratch/05861/tg851601/IsraelBig40SenDT21[1046] cat run_files_err07_bus/run_07_pairs_misreg_0.job
! /bin/bash
SBATCH -J run_07_pairs_misreg_0
SBATCH -A TG-EAR200012
SBATCH --mail-user=famelung@rsmas.miami.edu
SBATCH --mail-type=fail
SBATCH -N 2
SBATCH -n 96
SBATCH -o /scratch/05861/tg851601/IsraelBig40SenDT21/run_files/run_07_pairs_misreg0%J.o
SBATCH -e /scratch/05861/tg851601/IsraelBig40SenDT21/run_files/run_07_pairs_misreg0%J.e
SBATCH -p skx-normal
SBATCH -t 0:26:00
module load launcher export OMP_NUM_THREADS=2 export PATH=/scratch/05861/tg851601/code2/rsmas_insar/sources/isce2/contrib/stack/topsStack:$PATH export LAUNCHER_WORKDIR=/scratch/05861/tg851601/IsraelBig40SenDT21 export LAUNCHER_PPN=44
export LAUNCHER_JOB_FILE=/scratch/05861/tg851601/IsraelBig40SenDT21/run_files/run_07_pairs_misreg_0 module load python_cacher export LD_PRELOAD=/home1/apps/tacc-patches/python_cacher/myopen.so
$LAUNCHER_DIR/paramrun
//login3/scratch/05861/tg851601/IsraelBig40SenDT21[1046] lsd | grep run run_files_err07_bus/ run_files/ //login3/scratch/05861/tg851601/IsraelBig40SenDT21[1047] cat run_files_err07_bus/out_run_07_pairs_misreg.e #########################
/scratch/05861/tg851601/IsraelBig40SenDT21/run_files/run_07_pairs_misreg_0_6867976.e
######################### using /tmp/launcher.6867976.hostlist.7K3t6G1u to get hosts starting job on c489-031 starting job on c490-134 #########################
/scratch/05861/tg851601/IsraelBig40SenDT21/run_files/run_07_pairs_misreg_1_6867977.e
######################### using /tmp/launcher.6867977.hostlist.56JwfLi4 to get hosts starting job on c491-071 starting job on c491-072 #########################
/scratch/05861/tg851601/IsraelBig40SenDT21/run_files/run_07_pairs_misreg_2_6867978.e
######################### using /tmp/launcher.6867978.hostlist.vi7ZSr3j to get hosts starting job on c491-082 starting job on c491-093 #########################
/scratch/05861/tg851601/IsraelBig40SenDT21/run_files/run_07_pairs_misreg_3_6867979.e
######################### using /tmp/launcher.6867979.hostlist.x21luGCy to get hosts starting job on c477-044 starting job on c477-051 #########################
/scratch/05861/tg851601/IsraelBig40SenDT21/run_files/run_07_pairs_misreg_4_6867980.e
######################### using /tmp/launcher.6867980.hostlist.8bfEnA9e to get hosts starting job on c496-094 starting job on c499-103 #########################
/scratch/05861/tg851601/IsraelBig40SenDT21/run_files/run_07_pairs_misreg_5_6867981.e
######################### using /tmp/launcher.6867981.hostlist.recY4lVQ to get hosts starting job on c502-034 starting job on c502-041 #########################
/scratch/05861/tg851601/IsraelBig40SenDT21/run_files/run_07_pairs_misreg_6_6867982.e
######################### using /tmp/launcher.6867982.hostlist.FAK4Dqw1 to get hosts starting job on c500-132 starting job on c502-014 #########################
/scratch/05861/tg851601/IsraelBig40SenDT21/run_files/run_07_pairs_misreg_7_6867983.e
######################### using /tmp/launcher.6867983.hostlist.OpcNm9pn to get hosts starting job on c504-111 starting job on c504-112 /opt/apps/launcher/launcher-3.7/launcher: line 93: 218863 Bus error SentinelWrapper.py -c /scratch/05861/tg851601/IsraelBig40SenDT21/configs/config_misreg_20191210_20191222 > /scratch/05861/tg851601/IsraelBig40SenDT21/run_files/run_07_pairs_misreg_7_2019121020191222$LAUNCHER_JID.o 2> /scratch/05861/tg851601/IsraelBig40SenDT21/run_files/run_07_pairs_misreg_7_2019121020191222$LAUNCHER_JID.e #########################
/scratch/05861/tg851601/IsraelBig40SenDT21/run_files/run_07_pairs_misreg_8_6868003.e
######################### using /tmp/launcher.6868003.hostlist.NC1XIGf5 to get hosts starting job on c504-113 starting job on c504-114 #########################
/scratch/05861/tg851601/IsraelBig40SenDT21/run_files/run_07_pairs_misreg_error_matches.e
######################### Error: "Bus" found in /scratch/05861/tg851601/IsraelBig40SenDT21/run_files/run_07_pairs_misreg_7_6867983.e