geodesymiami / rsmas_insar

RSMAS InSAR code
https://rsmas-insar.readthedocs.io/
GNU General Public License v3.0
62 stars 23 forks source link

Stampede2: python_cacher suggested by TACC does not work #434

Closed falkamelung closed 3 years ago

falkamelung commented 3 years ago

Advised by Si Li@TACC I added module load python_cacher to the jobfiles. Unfortunately, the jobs either fail (error message Fail to open new file or take very long (I have not investigated in which cases which of the two occurs). Here three cases are documented. (1) Job files as suggested by Si, (2) the original job file which works fine, and (3) a previous iteration suugested by TACC ( module load python_cacher and an export LD_PRELOAD command:

  1. Job file with python cacher:
cat run_files/run_07_pairs_misreg_0.job 
#! /bin/bash
#SBATCH -J run_07_pairs_misreg_0
#SBATCH -A TG-EAR200012
#SBATCH --mail-user=famelung@rsmas.miami.edu
#SBATCH --mail-type=fail
#SBATCH -N 2
#SBATCH -n 96
#SBATCH -o /scratch/05861/tg851601/IsraelBig40SenDT21/run_files/run_07_pairs_misreg_0_%J.o
#SBATCH -e /scratch/05861/tg851601/IsraelBig40SenDT21/run_files/run_07_pairs_misreg_0_%J.e
#SBATCH -p skx-normal
#SBATCH -t 00:43:12

module load launcher
export OMP_NUM_THREADS=2
export PATH=/scratch/05861/tg851601/code2/rsmas_insar/sources/isce2/contrib/stack/topsStack:$PATH
export LAUNCHER_WORKDIR=/scratch/05861/tg851601/IsraelBig40SenDT21
export LAUNCHER_PPN=44

export LAUNCHER_JOB_FILE=/scratch/05861/tg851601/IsraelBig40SenDT21/run_files/run_07_pairs_misreg_0
module load python_cacher 

$LAUNCHER_DIR/paramrun

On an interactive node I said module load python_cacher and got the following error message:

/scratch/05861/tg851601/IsraelBig40SenDT21/run_files_err07_module_python_cacher[1007] SentinelWrapper.py -c /scratch/05861/tg851601/IsraelBig40SenDT21/configs/config_misreg_20160118_20160130 
pDir_Info[idx_Rec].Offset > MAX_ENTRY_BUFF_LEN
You need to increase MAX_ENTRY_BUFF_LEN
Quit
  1. WIthout python cacher iIt works fine. Below the job file and run times. That the first run_07 job (run_07_pairs_misreg_0) takes twice the time compared the others (run_07_pairs_misreg_1) is indication for a significant problem but unrelated to python_cacher.
    
    cat run_07_pairs_misreg_0.job 
    #! /bin/bash
    #SBATCH -J run_07_pairs_misreg_0
    #SBATCH -A TG-EAR200012
    #SBATCH --mail-user=famelung@rsmas.miami.edu
    #SBATCH --mail-type=fail
    #SBATCH -N 2
    #SBATCH -n 96
    #SBATCH -o /scratch/05861/tg851601/IsraelBig40SenDT21/run_files/run_07_pairs_misreg_0_%J.o
    #SBATCH -e /scratch/05861/tg851601/IsraelBig40SenDT21/run_files/run_07_pairs_misreg_0_%J.e
    #SBATCH -p skx-normal
    #SBATCH -t 0:26:00

module load launcher export OMP_NUM_THREADS=2 export PATH=/scratch/05861/tg851601/code2/rsmas_insar/sources/isce2/contrib/stack/topsStack:$PATH export LAUNCHER_WORKDIR=/scratch/05861/tg851601/IsraelBig40SenDT21 export LAUNCHER_PPN=44

export LAUNCHER_JOB_FILE=/scratch/05861/tg851601/IsraelBig40SenDT21/run_files/run_07_pairs_misreg_0

$LAUNCHER_DIR/paramrun

Number of bursts: 24 NNodes Timelimit Reserved Elapsed Time_per_burst run_07_pairs_misreg_0 2 00:26:00 02:46:08 00:19:59 00:00:49 run_07_pairs_misreg_1 2 00:26:00 05:19:14 00:09:38 00:00:24 run_07_pairs_misreg_2 2 00:26:00 05:19:11 00:09:35 00:00:23 run_07_pairs_misreg_3 2 00:26:00 05:19:09 00:09:24 00:00:23 run_07_pairs_misreg_4 2 00:26:00 05:19:06 00:09:39 00:00:24 run_07_pairs_misreg_5 2 00:26:00 05:19:06 00:09:16 00:00:23 run_07_pairs_misreg_6 2 00:26:00 05:19:01 00:09:43 00:00:24 run_07_pairs_misreg_7 2 00:26:00 05:31:02 00:08:34 00:00:21 run_07_pairs_misreg_8 2 00:38:00 04:48:32 00:00:48 00:00:02


3. Previously I had `module load python_cacher ` and `export LD_PRELOAD=/home1/apps/tacc-patches/python_cacher/myopen.so` (as advised by TACC) but launcher throws a Bus error. My guess is larger memory usage (Si said 400 MB per job, corresponding to ~20GB for 48 jobs running on one node). I did not try systematically whether giving the jobs more memory would resolve the problem. 

//login3/scratch/05861/tg851601/IsraelBig40SenDT21[1046] cat run_files_err07_bus/run_07_pairs_misreg_0.job

! /bin/bash

SBATCH -J run_07_pairs_misreg_0

SBATCH -A TG-EAR200012

SBATCH --mail-user=famelung@rsmas.miami.edu

SBATCH --mail-type=fail

SBATCH -N 2

SBATCH -n 96

SBATCH -o /scratch/05861/tg851601/IsraelBig40SenDT21/run_files/run_07_pairs_misreg0%J.o

SBATCH -e /scratch/05861/tg851601/IsraelBig40SenDT21/run_files/run_07_pairs_misreg0%J.e

SBATCH -p skx-normal

SBATCH -t 0:26:00

module load launcher export OMP_NUM_THREADS=2 export PATH=/scratch/05861/tg851601/code2/rsmas_insar/sources/isce2/contrib/stack/topsStack:$PATH export LAUNCHER_WORKDIR=/scratch/05861/tg851601/IsraelBig40SenDT21 export LAUNCHER_PPN=44

export LAUNCHER_JOB_FILE=/scratch/05861/tg851601/IsraelBig40SenDT21/run_files/run_07_pairs_misreg_0 module load python_cacher export LD_PRELOAD=/home1/apps/tacc-patches/python_cacher/myopen.so

$LAUNCHER_DIR/paramrun

//login3/scratch/05861/tg851601/IsraelBig40SenDT21[1046] lsd | grep run run_files_err07_bus/ run_files/ //login3/scratch/05861/tg851601/IsraelBig40SenDT21[1047] cat run_files_err07_bus/out_run_07_pairs_misreg.e #########################

/scratch/05861/tg851601/IsraelBig40SenDT21/run_files/run_07_pairs_misreg_0_6867976.e

######################### using /tmp/launcher.6867976.hostlist.7K3t6G1u to get hosts starting job on c489-031 starting job on c490-134 #########################

/scratch/05861/tg851601/IsraelBig40SenDT21/run_files/run_07_pairs_misreg_1_6867977.e

######################### using /tmp/launcher.6867977.hostlist.56JwfLi4 to get hosts starting job on c491-071 starting job on c491-072 #########################

/scratch/05861/tg851601/IsraelBig40SenDT21/run_files/run_07_pairs_misreg_2_6867978.e

######################### using /tmp/launcher.6867978.hostlist.vi7ZSr3j to get hosts starting job on c491-082 starting job on c491-093 #########################

/scratch/05861/tg851601/IsraelBig40SenDT21/run_files/run_07_pairs_misreg_3_6867979.e

######################### using /tmp/launcher.6867979.hostlist.x21luGCy to get hosts starting job on c477-044 starting job on c477-051 #########################

/scratch/05861/tg851601/IsraelBig40SenDT21/run_files/run_07_pairs_misreg_4_6867980.e

######################### using /tmp/launcher.6867980.hostlist.8bfEnA9e to get hosts starting job on c496-094 starting job on c499-103 #########################

/scratch/05861/tg851601/IsraelBig40SenDT21/run_files/run_07_pairs_misreg_5_6867981.e

######################### using /tmp/launcher.6867981.hostlist.recY4lVQ to get hosts starting job on c502-034 starting job on c502-041 #########################

/scratch/05861/tg851601/IsraelBig40SenDT21/run_files/run_07_pairs_misreg_6_6867982.e

######################### using /tmp/launcher.6867982.hostlist.FAK4Dqw1 to get hosts starting job on c500-132 starting job on c502-014 #########################

/scratch/05861/tg851601/IsraelBig40SenDT21/run_files/run_07_pairs_misreg_7_6867983.e

######################### using /tmp/launcher.6867983.hostlist.OpcNm9pn to get hosts starting job on c504-111 starting job on c504-112 /opt/apps/launcher/launcher-3.7/launcher: line 93: 218863 Bus error SentinelWrapper.py -c /scratch/05861/tg851601/IsraelBig40SenDT21/configs/config_misreg_20191210_20191222 > /scratch/05861/tg851601/IsraelBig40SenDT21/run_files/run_07_pairs_misreg_7_2019121020191222$LAUNCHER_JID.o 2> /scratch/05861/tg851601/IsraelBig40SenDT21/run_files/run_07_pairs_misreg_7_2019121020191222$LAUNCHER_JID.e #########################

/scratch/05861/tg851601/IsraelBig40SenDT21/run_files/run_07_pairs_misreg_8_6868003.e

######################### using /tmp/launcher.6868003.hostlist.NC1XIGf5 to get hosts starting job on c504-113 starting job on c504-114 #########################

/scratch/05861/tg851601/IsraelBig40SenDT21/run_files/run_07_pairs_misreg_error_matches.e

######################### Error: "Bus" found in /scratch/05861/tg851601/IsraelBig40SenDT21/run_files/run_07_pairs_misreg_7_6867983.e

falkamelung commented 3 years ago

received feedback form TACC which is implemented and works. Now we can install and run code from $WORK directory