Closed gianthk closed 9 months ago
Any python call on gpunode1 returns Bus error (core dumped)
python /home/beats/PycharmProjects/BEATS_recon/scripts/rum/BEATS_recon.py /mnt/PETRA/SED/BEATS/IH/fiber_wet_below_kink_HR-20230726T171211/fiber_wet_below_kink_HR-20230726T171211.h5 -s 1000 1200 --recon_dir /home/beats/Data/AlHandawi/fiber_wet_below_kink_HR-20230726T171211/recon_phase_alpha0.0002/ --work_dir /home/beats/Data/AlHandawi/fiber_wet_below_kink_HR-20230726T171211/ --cor 718.5 --ncore 36 --phase --no-phase_pad --alpha 0.0002 --pixelsize 0.00065 --sdd 20
ml load anaconda/tomopy
export NUMEXPR_MAX_THREADS=96
python /PETRA/SED/BEATS/IH/scratch/scripts/BEATS_recon.py /PETRA/SED/BEATS/IH/fiber_wet_below_kink_HR-20230726T171211/fiber_wet_below_kink_HR-20230726T171211.h5 -s 1000 1200 --recon_dir /PETRA/SED/BEATS/IH/scratch/AlHandawi/fiber_wet_below_kink_HR-20230726T171211/recon_test/ --work_dir /PETRA/SED/BEATS/IH/scratch/AlHandawi/fiber_wet_below_kink_HR-20230726T171211/ --cor 718.5 --ncore 36
python /PETRA/SED/BEATS/IH/scratch/scripts/BEATS_recon.py /PETRA/SED/BEATS/IH/fiber_wet_below_kink_HR-20230726T171211/fiber_wet_below_kink_HR-20230726T171211.h5 -s 1000 1200 --recon_dir /PETRA/SED/BEATS/IH/scratch/AlHandawi/fiber_wet_below_kink_HR-20230726T171211/recon_phase_alpha0.0002/ --work_dir /PETRA/SED/BEATS/IH/scratch/AlHandawi/fiber_wet_below_kink_HR-20230726T171211/ --cor 718.5 --ncore 36 --phase --no-phase_pad --alpha 0.0002 --pixelsize 0.00065 --sdd 20
--phase_pad
this call is successful on cpunode. executed in < 2 sec--no-phase_pad
this call is stuck infinitely on cpunode. Even killing python processes one by one does not helpAny python call on gpunode1 returns
Bus error (core dumped)
let
30.07.2023 - investigating slurm crash on Rum nodes
python call for BL-BEATS-WS01
python /home/beats/PycharmProjects/BEATS_recon/scripts/rum/BEATS_recon.py /mnt/PETRA/SED/BEATS/IH/fiber_wet_below_kink_HR-20230726T171211/fiber_wet_below_kink_HR-20230726T171211.h5 -s 1000 1200 --recon_dir /home/beats/Data/AlHandawi/fiber_wet_below_kink_HR-20230726T171211/recon_phase_alpha0.0002/ --work_dir /home/beats/Data/AlHandawi/fiber_wet_below_kink_HR-20230726T171211/ --cor 718.5 --ncore 36 --phase --no-phase_pad --alpha 0.0002 --pixelsize 0.00065 --sdd 20
* this call is successful for both '--phase_pad' and '--no-phase_pad'. executed in < 1 min
python calls for Rum
ml load anaconda/tomopy
export NUMEXPR_MAX_THREADS=96
Standard reconstruction:
python /PETRA/SED/BEATS/IH/scratch/scripts/BEATS_recon.py /PETRA/SED/BEATS/IH/fiber_wet_below_kink_HR-20230726T171211/fiber_wet_below_kink_HR-20230726T171211.h5 -s 1000 1200 --recon_dir /PETRA/SED/BEATS/IH/scratch/AlHandawi/fiber_wet_below_kink_HR-20230726T171211/recon_test/ --work_dir /PETRA/SED/BEATS/IH/scratch/AlHandawi/fiber_wet_below_kink_HR-20230726T171211/ --cor 718.5 --ncore 36
* this call is successful on cpunode. executed in < 2 sec
Phase-reconstruction
python /PETRA/SED/BEATS/IH/scratch/scripts/BEATS_recon.py /PETRA/SED/BEATS/IH/fiber_wet_below_kink_HR-20230726T171211/fiber_wet_below_kink_HR-20230726T171211.h5 -s 1000 1200 --recon_dir /PETRA/SED/BEATS/IH/scratch/AlHandawi/fiber_wet_below_kink_HR-20230726T171211/recon_phase_alpha0.0002/ --work_dir /PETRA/SED/BEATS/IH/scratch/AlHandawi/fiber_wet_below_kink_HR-20230726T171211/ --cor 718.5 --ncore 36 --phase --no-phase_pad --alpha 0.0002 --pixelsize 0.00065 --sdd 20
* with `--phase_pad` this call is successful on cpunode. executed in < 2 sec * @Salman-matalgah with `--no-phase_pad` **this call is stuck infinitely on cpunode. Even killing python processes one by one does not help**
I was able to reproduce the issue on gounode2, see error below which i got when i run the code natively on the gpunode2 , I will check the installed modules on all CPU/GPU nodes, then I will make a another test
(tomopy) [root@gpunode2 ~]# (tomopy) [root@gpunode2 ~]# python /PETRA/SED/BEATS/IH/scratch/scripts/BEATS_recon.py /PETRA/SED/BEATS/I H/fiber_wet_below_kink_HR-20230726T171211/fiber_wet_below_kink_HR-20230726T171211.h5 -s 1000 1200 --reco n_dir /PETRA/SED/BEATS/IH/scratch/AlHandawi/fiber_wet_below_kink_HR-20230726T171211/recon_phase_alpha0.0 002/ --work_dir /PETRA/SED/BEATS/IH/scratch/AlHandawi/fiber_wet_below_kink_HR-20230726T171211/ --cor 718 .5 --ncore 36 --phase --no-phase_pad --alpha 0.0002 --pixelsize 0.00065 --sdd 20 spefile module not found EdfFile module not found Process ForkPoolWorker-37: Process ForkPoolWorker-31: Process ForkPoolWorker-32: Process ForkPoolWorker-35: Process ForkPoolWorker-34: Process ForkPoolWorker-33: Process ForkPoolWorker-36: Traceback (most recent call last): File "/PETRA/cluster_software/install/anaconda/envs/tomopy/lib/python3.10/site-packages/tomopy/util/mp roc.py", line 312, in distribute_jobs
Any python call on gpunode1 returns
Bus error (core dumped)
I will need to run any of theses python codes interactively (natively) to check the log, this could be caused by code, memory or libraries, i would prefer to do this together once you come back
@Salman-matalgah I could finally reproduce on gpunode2 within python!
ssh gpunode2
ml import anaconda/tomopy
export NUMEXPR_MAX_THREADS=96
python
import dxchange
import tomopy
import numpy as np
h5file = "/PETRA/SED/BEATS/IH/fiber_drying_dynamic-20230726T110245/fiber_drying_dynamic-20230726T110245.h5"
recon_dir = "/PETRA/SED/BEATS/IH/scratch/AlHandawi/fiber_drying_dynamic-20230726T110245/phase/recon_01/"
ncore = 48
projs, flats, darks, _ = dxchange.read_aps_32id(h5file, exchange_rank=0, proj=(1,401,1), sino=(1000, 1361, 1))
theta = np.radians(dxchange.read_hdf5(h5file, 'exchange/theta', slc=((1,401,1),)))
projs = tomopy.normalize(projs, flats, darks, ncore)
projs = tomopy.retrieve_phase(projs, pixel_size=0.1*0.00065, dist=0.1*20, energy=20, alpha=0.0002, pad=False, ncore=48, nchunk=None)
Ctrl+C
I get:^CProcess ForkPoolWorker-95:
^CTraceback (most recent call last):
File "/PETRA/cluster_software/install/anaconda/envs/tomopy/lib/python3.10/site-packages/tomopy/util/mproc.py", line 312, in distribute_jobs
@ifoudeh , Note: the only host which got stuck is gpunode2 only, while cpunode and gounde1 has no issues at all with the same code and runs, this to be reported to lenovo support, gpunode1 and gpunode2 supposed to be identical technically and from the application and software side.
new error on GPUnode 1 :
(tomopy) [root@gpunode1 ~]# python /PETRA/SED/BEATS/IH/scratch/scripts/BEATS_recon.py /PETRA/SED/BEATS/IH/fiber_wet_below_kink_HR-20230726T171211/fiber_wet_below_kink_HR-20230726T171211.h5 -s 1000 1200 --recon_dir /PETRA/SED/BEATS/IH/scratch/AlHandawi/fiber_wet_below_kink_HR-20230726T171211/recon_phase_alpha0.0002/ --work_dir /PETRA/SED/BEATS/IH/scratch/AlHandawi/fiber_wet_below_kink_HR-20230726T171211/ --cor 718.5 --ncore 36 --phase --no-phase_pad --alpha 0.0002 --pixelsize 0.00065 --sdd 20 Bus error (core dumped) (tomopy) [root@gpunode1 ~]# df -h df: /PETRA: Stale file handle Filesystem Size Used Avail Use% Mounted on devtmpfs 284G 0 284G 0% /dev tmpfs 284G 18M 284G 1% /dev/shm tmpfs 284G 59M 284G 1% /run tmpfs 284G 0 284G 0% /sys/fs/cgroup /dev/mapper/xcatvg-root 218G 9.1G 209G 5% / /dev/sda2 1014M 242M 773M 24% /boot /dev/sda1 256M 5.8M 250M 3% /boot/efi 10.1.32.20:/opt/ohpc/pub 212G 107G 106G 51% /opt/ohpc/pub tmpfs 57G 0 57G 0% /run/user/1002 tmpfs 57G 0 57G 0% /
this to be reported to lenovo team
This error also happens while testing DAQ: cent call last): File "/home/control/BEATSH5Writer/runAsServer.py", line 75, in runWriterServer await self.writeNewFile() File "/home/control/BEATSH5Writer/runAsServer.py", line 97, in writeNewFile os.makedirs(filePath) File "/usr/local/lib/python3.9/os.py", line 225, in makedirs mkdir(name, mode) OSError: [Errno 5] Input/output error: '/PETRA/SED/BEATS/IH/testRMRM-20231022T140431' $^./start_writer: line 14: 1601149 Quit (core dumped) python runAsServer.py --detector pco --motionStage micos --scanMode continuous
The cause of the problem is excessive resource allocation when TomoPy and SLURM (alone or in combination) assign CPU resources to the reconstruction job. On the workstation this does not happen. Please see a full report below and in: /PETRA/SED/BEATS/IH/scratch/AlHandawi/test/rum_bug_tests
As you will see, the performance of the workstation remains ~2times faster than rum, but I think we can improve this with a fine tune of the SLURM header and Python call on Rum (without going back to the same problem!).
Thank you all for the good support.
ml load anaconda/tomopy
export NUMEXPR_MAX_THREADS=96
python /PETRA/SED/BEATS/IH/scratch/scripts/BEATS_recon.py /PETRA/SED/BEATS/IH/fiber_wet_below_kink_HR-20230726T171211/fiber_wet_below_kink_HR-20230726T171211.h5 -s 1200 1240 --recon_dir /PETRA/SED/BEATS/IH/scratch/AlHandawi/fiber_wet_below_kink_HR-20230726T171211/recon_phase_alpha0.00002_b/ --work_dir /PETRA/SED/BEATS/IH/scratch/AlHandawi/ --cor 718.5 --ncore 8 --phase --sdd 20 --pixelsize 0.00065 --energy 16 --alpha 0.00002 --algorithm fbp_cuda_astra
python /PETRA/SED/BEATS/IH/scratch/scripts/BEATS_recon.py /PETRA/SED/BEATS/IH/fiber_wet_below_kink_HR-20230726T171211/fiber_wet_below_kink_HR-20230726T171211.h5 -s 1200 1240 --recon_dir /PETRA/SED/BEATS/IH/scratch/AlHandawi/fiber_wet_below_kink_HR-20230726T171211/recon_phase_alpha0.00002_b_WS/ --work_dir /PETRA/SED/BEATS/IH/scratch/AlHandawi/test/ --cor 718.5 --ncore 36 --phase --sdd 20 --pixelsize 0.00065 --energy 16 --alpha 0.00002 --algorithm fbp_cuda_astra
#!/bin/bash
#SBATCH --job-name=BEATS_rec_%j
#SBATCH --output=BEATS_rec_%j.out
#SBATCH --error=BEATS_rec_%j.err
#SBATCH --ntasks=11
#SBATCH --cpus-per-task=8
#SBATCH --time=00:20:00
#SBATCH --partition=gpu
#SBATCH --nodelist=gpunode1
#SBATCH --gres=gpu:1
#SBATCH --mem-per-cpu=2G
# Modules section:
ml load anaconda/tomopy
# Variables section:
export NUMEXPR_MAX_THREADS=96
investigate bug in
BEATS_recon.py