gianthk commented 1 year ago

investigate bug in BEATS_recon.py

gianthk commented 1 year ago

Any python call on gpunode1 returns Bus error (core dumped)

gianthk commented 1 year ago

30.07.2023 - investigating slurm crash on Rum nodes

python call for BL-BEATS-WS01

python /home/beats/PycharmProjects/BEATS_recon/scripts/rum/BEATS_recon.py /mnt/PETRA/SED/BEATS/IH/fiber_wet_below_kink_HR-20230726T171211/fiber_wet_below_kink_HR-20230726T171211.h5 -s 1000 1200 --recon_dir /home/beats/Data/AlHandawi/fiber_wet_below_kink_HR-20230726T171211/recon_phase_alpha0.0002/ --work_dir /home/beats/Data/AlHandawi/fiber_wet_below_kink_HR-20230726T171211/ --cor 718.5 --ncore 36 --phase --no-phase_pad --alpha 0.0002 --pixelsize 0.00065 --sdd 20

this call is successful for both '--phase_pad' and '--no-phase_pad'. executed in < 1 min

python calls for Rum

ml load anaconda/tomopy

export NUMEXPR_MAX_THREADS=96

Standard reconstruction:

python /PETRA/SED/BEATS/IH/scratch/scripts/BEATS_recon.py /PETRA/SED/BEATS/IH/fiber_wet_below_kink_HR-20230726T171211/fiber_wet_below_kink_HR-20230726T171211.h5 -s 1000 1200 --recon_dir /PETRA/SED/BEATS/IH/scratch/AlHandawi/fiber_wet_below_kink_HR-20230726T171211/recon_test/ --work_dir /PETRA/SED/BEATS/IH/scratch/AlHandawi/fiber_wet_below_kink_HR-20230726T171211/ --cor 718.5 --ncore 36

this call is successful on cpunode. executed in < 2 sec

Phase-reconstruction

python /PETRA/SED/BEATS/IH/scratch/scripts/BEATS_recon.py /PETRA/SED/BEATS/IH/fiber_wet_below_kink_HR-20230726T171211/fiber_wet_below_kink_HR-20230726T171211.h5 -s 1000 1200 --recon_dir /PETRA/SED/BEATS/IH/scratch/AlHandawi/fiber_wet_below_kink_HR-20230726T171211/recon_phase_alpha0.0002/ --work_dir /PETRA/SED/BEATS/IH/scratch/AlHandawi/fiber_wet_below_kink_HR-20230726T171211/ --cor 718.5 --ncore 36 --phase --no-phase_pad --alpha 0.0002 --pixelsize 0.00065 --sdd 20

with --phase_pad this call is successful on cpunode. executed in < 2 sec
@Salman-matalgah with --no-phase_pad this call is stuck infinitely on cpunode. Even killing python processes one by one does not help

Salman-matalgah commented 1 year ago

Any python call on gpunode1 returns Bus error (core dumped)

let

30.07.2023 - investigating slurm crash on Rum nodes

python call for BL-BEATS-WS01

python /home/beats/PycharmProjects/BEATS_recon/scripts/rum/BEATS_recon.py /mnt/PETRA/SED/BEATS/IH/fiber_wet_below_kink_HR-20230726T171211/fiber_wet_below_kink_HR-20230726T171211.h5 -s 1000 1200 --recon_dir /home/beats/Data/AlHandawi/fiber_wet_below_kink_HR-20230726T171211/recon_phase_alpha0.0002/ --work_dir /home/beats/Data/AlHandawi/fiber_wet_below_kink_HR-20230726T171211/ --cor 718.5 --ncore 36 --phase --no-phase_pad --alpha 0.0002 --pixelsize 0.00065 --sdd 20
* this call is successful for both '--phase_pad' and '--no-phase_pad'. executed in < 1 min
python calls for Rum

ml load anaconda/tomopy

export NUMEXPR_MAX_THREADS=96

Standard reconstruction:

python /PETRA/SED/BEATS/IH/scratch/scripts/BEATS_recon.py /PETRA/SED/BEATS/IH/fiber_wet_below_kink_HR-20230726T171211/fiber_wet_below_kink_HR-20230726T171211.h5 -s 1000 1200 --recon_dir /PETRA/SED/BEATS/IH/scratch/AlHandawi/fiber_wet_below_kink_HR-20230726T171211/recon_test/ --work_dir /PETRA/SED/BEATS/IH/scratch/AlHandawi/fiber_wet_below_kink_HR-20230726T171211/ --cor 718.5 --ncore 36
* this call is successful on cpunode. executed in < 2 sec
Phase-reconstruction

python /PETRA/SED/BEATS/IH/scratch/scripts/BEATS_recon.py /PETRA/SED/BEATS/IH/fiber_wet_below_kink_HR-20230726T171211/fiber_wet_below_kink_HR-20230726T171211.h5 -s 1000 1200 --recon_dir /PETRA/SED/BEATS/IH/scratch/AlHandawi/fiber_wet_below_kink_HR-20230726T171211/recon_phase_alpha0.0002/ --work_dir /PETRA/SED/BEATS/IH/scratch/AlHandawi/fiber_wet_below_kink_HR-20230726T171211/ --cor 718.5 --ncore 36 --phase --no-phase_pad --alpha 0.0002 --pixelsize 0.00065 --sdd 20
* with `--phase_pad` this call is successful on cpunode. executed in < 2 sec

* @Salman-matalgah with `--no-phase_pad` **this call is stuck infinitely on cpunode. Even killing python processes one by one does not help**

I was able to reproduce the issue on gounode2, see error below which i got when i run the code natively on the gpunode2 , I will check the installed modules on all CPU/GPU nodes, then I will make a another test

(tomopy) [root@gpunode2 ~]# (tomopy) [root@gpunode2 ~]# python /PETRA/SED/BEATS/IH/scratch/scripts/BEATS_recon.py /PETRA/SED/BEATS/I H/fiber_wet_below_kink_HR-20230726T171211/fiber_wet_below_kink_HR-20230726T171211.h5 -s 1000 1200 --reco n_dir /PETRA/SED/BEATS/IH/scratch/AlHandawi/fiber_wet_below_kink_HR-20230726T171211/recon_phase_alpha0.0 002/ --work_dir /PETRA/SED/BEATS/IH/scratch/AlHandawi/fiber_wet_below_kink_HR-20230726T171211/ --cor 718 .5 --ncore 36 --phase --no-phase_pad --alpha 0.0002 --pixelsize 0.00065 --sdd 20 spefile module not found EdfFile module not found Process ForkPoolWorker-37: Process ForkPoolWorker-31: Process ForkPoolWorker-32: Process ForkPoolWorker-35: Process ForkPoolWorker-34: Process ForkPoolWorker-33: Process ForkPoolWorker-36: Traceback (most recent call last): File "/PETRA/cluster_software/install/anaconda/envs/tomopy/lib/python3.10/site-packages/tomopy/util/mp roc.py", line 312, in distribute_jobs

Salman-matalgah commented 1 year ago

Any python call on gpunode1 returns Bus error (core dumped)

I will need to run any of theses python codes interactively (natively) to check the log, this could be caused by code, memory or libraries, i would prefer to do this together once you come back

gianthk commented 1 year ago

@Salman-matalgah I could finally reproduce on gpunode2 within python!

login to gpunode2 and load modules; launch python

ssh gpunode2
ml import anaconda/tomopy
export NUMEXPR_MAX_THREADS=96
python

In python this causes the crash:

import dxchange
import tomopy
import numpy as np

h5file = "/PETRA/SED/BEATS/IH/fiber_drying_dynamic-20230726T110245/fiber_drying_dynamic-20230726T110245.h5"
recon_dir = "/PETRA/SED/BEATS/IH/scratch/AlHandawi/fiber_drying_dynamic-20230726T110245/phase/recon_01/"
ncore = 48
projs, flats, darks, _ = dxchange.read_aps_32id(h5file, exchange_rank=0, proj=(1,401,1), sino=(1000, 1361, 1))
theta = np.radians(dxchange.read_hdf5(h5file, 'exchange/theta', slc=((1,401,1),)))
projs = tomopy.normalize(projs, flats, darks, ncore)
projs = tomopy.retrieve_phase(projs, pixel_size=0.1*0.00065, dist=0.1*20, energy=20, alpha=0.0002, pad=False, ncore=48, nchunk=None)

When I try to abort with Ctrl+C I get:

^CProcess ForkPoolWorker-95:
^CTraceback (most recent call last):
  File "/PETRA/cluster_software/install/anaconda/envs/tomopy/lib/python3.10/site-packages/tomopy/util/mproc.py", line 312, in distribute_jobs

Salman-matalgah commented 1 year ago

@ifoudeh , Note: the only host which got stuck is gpunode2 only, while cpunode and gounde1 has no issues at all with the same code and runs, this to be reported to lenovo support, gpunode1 and gpunode2 supposed to be identical technically and from the application and software side.

ifoudeh commented 1 year ago

new error on GPUnode 1 :

(tomopy) [root@gpunode1 ~]# python /PETRA/SED/BEATS/IH/scratch/scripts/BEATS_recon.py /PETRA/SED/BEATS/IH/fiber_wet_below_kink_HR-20230726T171211/fiber_wet_below_kink_HR-20230726T171211.h5 -s 1000 1200 --recon_dir /PETRA/SED/BEATS/IH/scratch/AlHandawi/fiber_wet_below_kink_HR-20230726T171211/recon_phase_alpha0.0002/ --work_dir /PETRA/SED/BEATS/IH/scratch/AlHandawi/fiber_wet_below_kink_HR-20230726T171211/ --cor 718.5 --ncore 36 --phase --no-phase_pad --alpha 0.0002 --pixelsize 0.00065 --sdd 20 Bus error (core dumped) (tomopy) [root@gpunode1 ~]# df -h df: /PETRA: Stale file handle Filesystem Size Used Avail Use% Mounted on devtmpfs 284G 0 284G 0% /dev tmpfs 284G 18M 284G 1% /dev/shm tmpfs 284G 59M 284G 1% /run tmpfs 284G 0 284G 0% /sys/fs/cgroup /dev/mapper/xcatvg-root 218G 9.1G 209G 5% / /dev/sda2 1014M 242M 773M 24% /boot /dev/sda1 256M 5.8M 250M 3% /boot/efi 10.1.32.20:/opt/ohpc/pub 212G 107G 106G 51% /opt/ohpc/pub tmpfs 57G 0 57G 0% /run/user/1002 tmpfs 57G 0 57G 0% /

this to be reported to lenovo team

Mustafa-zubi commented 1 year ago

This error also happens while testing DAQ: cent call last): File "/home/control/BEATSH5Writer/runAsServer.py", line 75, in runWriterServer await self.writeNewFile() File "/home/control/BEATSH5Writer/runAsServer.py", line 97, in writeNewFile os.makedirs(filePath) File "/usr/local/lib/python3.9/os.py", line 225, in makedirs mkdir(name, mode) OSError: [Errno 5] Input/output error: '/PETRA/SED/BEATS/IH/testRMRM-20231022T140431' $^./start_writer: line 14: 1601149 Quit (core dumped) python runAsServer.py --detector pco --motionStage micos --scanMode continuous

gianthk commented 9 months ago

The cause of the problem is excessive resource allocation when TomoPy and SLURM (alone or in combination) assign CPU resources to the reconstruction job. On the workstation this does not happen. Please see a full report below and in: /PETRA/SED/BEATS/IH/scratch/AlHandawi/test/rum_bug_tests

As you will see, the performance of the workstation remains ~2times faster than rum, but I think we can improve this with a fine tune of the SLURM header and Python call on Rum (without going back to the same problem!).

Thank you all for the good support.

reproducing gpunode1 reconstruction crush tests: pure Python calls (no SLURM)

on rum"sesame.org.jo (no SLURM)

python version: 3.10.10

ml load anaconda/tomopy
export NUMEXPR_MAX_THREADS=96

--ncore 96 -> fail

--ncore 36 -> fail

--ncore 8 -> success

python /PETRA/SED/BEATS/IH/scratch/scripts/BEATS_recon.py /PETRA/SED/BEATS/IH/fiber_wet_below_kink_HR-20230726T171211/fiber_wet_below_kink_HR-20230726T171211.h5 -s 1200 1240 --recon_dir /PETRA/SED/BEATS/IH/scratch/AlHandawi/fiber_wet_below_kink_HR-20230726T171211/recon_phase_alpha0.00002_b/ --work_dir /PETRA/SED/BEATS/IH/scratch/AlHandawi/ --cor 718.5 --ncore 8 --phase --sdd 20 --pixelsize 0.00065 --energy 16 --alpha 0.00002 --algorithm fbp_cuda_astra

on BL-BEATS-WS01

python version: 3.8.18

--ncore 36 -> success

python /PETRA/SED/BEATS/IH/scratch/scripts/BEATS_recon.py /PETRA/SED/BEATS/IH/fiber_wet_below_kink_HR-20230726T171211/fiber_wet_below_kink_HR-20230726T171211.h5 -s 1200 1240 --recon_dir /PETRA/SED/BEATS/IH/scratch/AlHandawi/fiber_wet_below_kink_HR-20230726T171211/recon_phase_alpha0.00002_b_WS/ --work_dir /PETRA/SED/BEATS/IH/scratch/AlHandawi/test/ --cor 718.5 --ncore 36 --phase --sdd 20 --pixelsize 0.00065 --energy 16 --alpha 0.00002 --algorithm fbp_cuda_astra

adapting the slurm call on masternode accordingly:

the following slurm header works in combination with --ncore 8
all recon steps are slower than on BL-BEATS-WS01
slurm file: /PETRA/SED/BEATS/IH/scratch/AlHandawi/run_BEATS_recon_static.sh
reconstruction log files for rum@sesame.org.jo: /PETRA/SED/BEATS/IH/scratch/AlHandawi/test
reconstruction log files for BL-BEATS-WS01: /PETRA/SED/BEATS/IH/scratch/AlHandawi/test/WS

#!/bin/bash
#SBATCH --job-name=BEATS_rec_%j
#SBATCH --output=BEATS_rec_%j.out
#SBATCH --error=BEATS_rec_%j.err
#SBATCH --ntasks=11
#SBATCH --cpus-per-task=8
#SBATCH --time=00:20:00
#SBATCH --partition=gpu
#SBATCH --nodelist=gpunode1
#SBATCH --gres=gpu:1
#SBATCH --mem-per-cpu=2G

# Modules section:
ml load anaconda/tomopy

# Variables section:
export NUMEXPR_MAX_THREADS=96

SESAME-Synchrotron / BEATS_recon

phase retrieval recon crash on Rum #15

30.07.2023 - investigating slurm crash on Rum nodes

python call for BL-BEATS-WS01

python calls for Rum

Standard reconstruction:

Phase-reconstruction

30.07.2023 - investigating slurm crash on Rum nodes

python call for BL-BEATS-WS01

python calls for Rum

Standard reconstruction:

Phase-reconstruction

reproducing gpunode1 reconstruction crush tests: pure Python calls (no SLURM)

on rum"sesame.org.jo (no SLURM)

python version: 3.10.10

--ncore 96 -> fail

--ncore 36 -> fail

--ncore 8 -> success

on BL-BEATS-WS01

python version: 3.8.18

--ncore 36 -> success

adapting the slurm call on masternode accordingly: