Open bommaritom opened 1 year ago
I came across another instance of this error, where the segmentation fault occurs in the second iteration (the penultimate), rather than the final resolution. The environment was the same as in the first instance I reported. I think this is relevant and novel information, so I am including the terminal output here:
Image parameters read:
x_0 = -3.200000e+01
y_0 = -3.200000e+01
z_0 = -3.200000e+01
N_x = 64
N_y = 64
N_z = 64
Delta_xy = 1.000000e+00
Delta_z = 1.000000e+00
j_xstart_roi = -1
j_ystart_roi = -1
j_zstart_roi = -1
j_xstop_roi = 0
j_ystop_roi = 0
j_zstop_roi = 0
Sinogram parameters read:
N_dv = 128,
N_dw = 128,
Delta_dv = 1.000000e+00,
Delta_dw = 1.000000e+00,
N_beta = 64,
u_s = -1.920000e+02,
u_r = 0.000000e+00,
v_r = 0.000000e+00,
u_d0 = 1.920000e+02,
v_d0 = -6.400000e+01,
w_d0 = -6.400000e+01,
(potentially uninitialized:)
weightScaler_value = -1.000000e+00,
Reconstruction parameters read:
proximal map mode = 0
q = 2.000000e+00
p = 1.200000e+00
T = 1.000000e-01
sigmaX = 7.176227e-03
bFace = 1.666667e-01
bEdge = -1.000000e+00
bVertex = -1.000000e+00
sigma_lambda = 1.000000e+00
is_positivity_constraint = 1
stopThresholdChange_pct = 2.000000e-02
stopThesholdRWFE_pct = 0.000000e+00
stopThesholdRUFE_pct = 0.000000e+00
MaxIterations = 100
relativeChangeMode = meanImage
relativeChangeScaler = 1.000000e-01
relativeChangePercentile = 9.990000e+01
N_G = 2
zipLineMode = 2
numVoxelsPerZiplineMax = 200
numVoxelsPerZipline = 64
numZiplines = 1
weightScaler_estimateMode = None
weightScaler_domain = spatiallyInvariant
weightScaler_value = 1.170822e-03
NHICD_Mode = off
NHICD_ThresholdAllVoxels_ErrorPercent = 8.000000e+01
NHICD_percentage = 1.500000e+01
NHICD_random = 2.000000e+01
verbosity = 1
isComputeCost = 1
************************** Iteration 0 (max. 100) **************************
* Cost = 8.0219145000e+06
* Rel. Update = 0.0000000000e+00 % (threshold = 1.9999999553e-02 %)
* RWFE = ||e||_W/||y||_W = 7.1882143021e+00 % (threshold = 0.0000000000e+00 %)
* RUFE = ||e|| / ||y|| = 7.1882143021e+00 % (threshold = 0.0000000000e+00 %)
* ----------------------------------------------------------------------------
* 1/M ||e||^2_W = 1.7826896161e-02 = 1/56.0950164795
* weightScaler_value = 1.1708220700e-03 = 1/854.1007690430
* ----------------------------------------------------------------------------
* voxelsPerSecond = 0.0000000000e+00
* time icd update = 5.2969558716e+01 s
* ratioUpdated = 0.0000000000e+00 %
* totalEquits = 0.0000000000e+00
******************************************************************************
************************** Iteration 1 (max. 100) **************************
* Cost = 1.6931286250e+06
* Rel. Update = 1.9823394775e+01 % (threshold = 1.9999999553e-02 %)
* RWFE = ||e||_W/||y||_W = 3.2219331264e+00 % (threshold = 0.0000000000e+00 %)
* RUFE = ||e|| / ||y|| = 3.2219331264e+00 % (threshold = 0.0000000000e+00 %)
* ----------------------------------------------------------------------------
* 1/M ||e||^2_W = 3.5815148149e-03 = 1/279.2114562988
* weightScaler_value = 1.1708220700e-03 = 1/854.1007690430
* ----------------------------------------------------------------------------
* voxelsPerSecond = inf
* time icd update = 5.4054439545e+01 s
* ratioUpdated = 1.0000000000e+02 %
* totalEquits = 1.0000000000e+00
******************************************************************************
************************** Iteration 2 (max. 100) **************************
* Cost = 8.8656956250e+05
* Rel. Update = 8.8022689819e+00 % (threshold = 1.9999999553e-02 %)
* RWFE = ||e||_W/||y||_W = 2.2821221352e+00 % (threshold = 0.0000000000e+00 %)
* RUFE = ||e|| / ||y|| = 2.2821221352e+00 % (threshold = 0.0000000000e+00 %)
* ----------------------------------------------------------------------------
* 1/M ||e||^2_W = 1.7968486063e-03 = 1/556.5299072266
* weightScaler_value = 1.1708220700e-03 = 1/854.1007690430
* ----------------------------------------------------------------------------
* voxelsPerSecond = inf
* time icd update = 5.5237743378e+01 s
* ratioUpdated = 1.0000000000e+02 %
* totalEquits = 2.0000000000e+00
******************************************************************************
************************** Iteration 3 (max. 100) **************************
* Cost = 7.2291850000e+05
* Rel. Update = 4.4582271576e+00 % (threshold = 1.9999999553e-02 %)
* RWFE = ||e||_W/||y||_W = 2.0467526913e+00 % (threshold = 0.0000000000e+00 %)
* RUFE = ||e|| / ||y|| = 2.0467526913e+00 % (threshold = 0.0000000000e+00 %)
* ----------------------------------------------------------------------------
* 1/M ||e||^2_W = 1.4453212498e-03 = 1/691.8876953125
* weightScaler_value = 1.1708220700e-03 = 1/854.1007690430
* ----------------------------------------------------------------------------
* voxelsPerSecond = inf
* time icd update = 5.6289184570e+01 s
* ratioUpdated = 1.0000000000e+02 %
* totalEquits = 3.0000000000e+00
******************************************************************************
************************** Iteration 4 (max. 100) **************************
* Cost = 6.8048525000e+05
* Rel. Update = 2.3605141640e+00 % (threshold = 1.9999999553e-02 %)
* RWFE = ||e||_W/||y||_W = 1.9835426807e+00 % (threshold = 0.0000000000e+00 %)
* RUFE = ||e|| / ||y|| = 1.9835426807e+00 % (threshold = 0.0000000000e+00 %)
* ----------------------------------------------------------------------------
* 1/M ||e||^2_W = 1.3574281475e-03 = 1/736.6872558594
* weightScaler_value = 1.1708220700e-03 = 1/854.1007690430
* ----------------------------------------------------------------------------
* voxelsPerSecond = inf
* time icd update = 5.7295703888e+01 s
* ratioUpdated = 1.0000000000e+02 %
* totalEquits = 4.0000000000e+00
******************************************************************************
************************** Iteration 5 (max. 100) **************************
* Cost = 6.6586556250e+05
* Rel. Update = 1.4191565514e+00 % (threshold = 1.9999999553e-02 %)
* RWFE = ||e||_W/||y||_W = 1.9621305466e+00 % (threshold = 0.0000000000e+00 %)
* RUFE = ||e|| / ||y|| = 1.9621305466e+00 % (threshold = 0.0000000000e+00 %)
* ----------------------------------------------------------------------------
* 1/M ||e||^2_W = 1.3282794971e-03 = 1/752.8535766602
* weightScaler_value = 1.1708220700e-03 = 1/854.1007690430
* ----------------------------------------------------------------------------
* voxelsPerSecond = inf
* time icd update = 5.8391109467e+01 s
* ratioUpdated = 1.0000000000e+02 %
* totalEquits = 5.0000000000e+00
******************************************************************************
************************** Iteration 6 (max. 100) **************************
* Cost = 6.6022262500e+05
* Rel. Update = 8.8864797354e-01 % (threshold = 1.9999999553e-02 %)
* RWFE = ||e||_W/||y||_W = 1.9540244341e+00 % (threshold = 0.0000000000e+00 %)
* RUFE = ||e|| / ||y|| = 1.9540244341e+00 % (threshold = 0.0000000000e+00 %)
* ----------------------------------------------------------------------------
* 1/M ||e||^2_W = 1.3173272600e-03 = 1/759.1127929688
* weightScaler_value = 1.1708220700e-03 = 1/854.1007690430
* ----------------------------------------------------------------------------
* voxelsPerSecond = inf
* time icd update = 5.9427299500e+01 s
* ratioUpdated = 1.0000000000e+02 %
* totalEquits = 6.0000000000e+00
******************************************************************************
************************** Iteration 7 (max. 100) **************************
* Cost = 6.5757468750e+05
* Rel. Update = 6.0869717598e-01 % (threshold = 1.9999999553e-02 %)
* RWFE = ||e||_W/||y||_W = 1.9502706528e+00 % (threshold = 0.0000000000e+00 %)
* RUFE = ||e|| / ||y|| = 1.9502706528e+00 % (threshold = 0.0000000000e+00 %)
* ----------------------------------------------------------------------------
* 1/M ||e||^2_W = 1.3122708770e-03 = 1/762.0377807617
* weightScaler_value = 1.1708220700e-03 = 1/854.1007690430
* ----------------------------------------------------------------------------
* voxelsPerSecond = inf
* time icd update = 6.0494937897e+01 s
* ratioUpdated = 1.0000000000e+02 %
* totalEquits = 7.0000000000e+00
******************************************************************************
Segmentation fault: 11
Thanks for spotting this Marco. Could you try one thing: change lines 445 and 617 of cone3D.py to be
num_threads = max(1, cpu_count(logical=False)-1)
Could you please change the code as above and see if this will fix the issue?
Previously I observed that the memory assignment of the last thread occasionally failed on clusters where I have to share memory/CPU usage with other users, but I never observed this on a PC.
I'm at a conference this week. Maybe we could meet up and work on a permanent fix after I come back next week.
Sure. I will let you know if the error still comes up. However, due to the random nature of the error it may be hard to tell if it is actually fixed.
Sure. I will let you know if the error still comes up. However, due to the random nature of the error it may be hard to tell if it is actually fixed.
Perhaps just run multiple times with/without the change and observe the behavior at the time being.
You can actually use gdb and trace the C code to see where the memory error occurs.
For example, in the case where the last thread fails to be assigned, you can use gdb to print out the variable value of "parallelAux->partialTheta" in line 821 of icd.c, and you can see something like this:
(In this case I'm using a cluster node with 24 cores, and you can see that the last core is not assigned correctly since parallelAux->partialTheta[23] cannot be accessed).
Anyways, we can discuss the details next week. For now maybe just test the code without getting into the C code.
Hey Diyu,
I implemented your suggested changes and ran demo_3D_shepp_logan.py
10 times in a row on master
without the segmentation fault.
I'm trying to learn how to use gdb. I run the command
> sudo gdb --args python demo/demo_3D_shepp_logan.py
The code hits a breakpoint at (what looks like) the entry point to demo/demo_3D_shepp_logan.py
, but I'm having trouble stepping through because gdb seems to treat the running of demo/demo_3D_shepp_logan.py
as a single action.
In general, I'm still a bit confused about gdb, and I was wondering if you would tell me how you have it set up. How do you compile your C code to use gdb, and how do you start it to debug the C code called from Python?
Some information:
> uname -a
Darwin MacBook-Air-6 18.7.0 Darwin Kernel Version 18.7.0: Tue Jun 22 19:37:08 PDT 2021; root:xnu-4903.278.70~1/RELEASE_X86_64 x86_64
> python --version
Python 3.8.13
> gdb --version
GNU gdb (GDB) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Hey Diyu,
I implemented your suggested changes and ran
demo_3D_shepp_logan.py
10 times in a row onmaster
without the segmentation fault.I'm trying to learn how to use gdb. I run the command
> sudo gdb --args python demo/demo_3D_shepp_logan.py
The code hits a breakpoint at (what looks like) the entry point to
demo/demo_3D_shepp_logan.py
, but I'm having trouble stepping through because gdb seems to treat the running ofdemo/demo_3D_shepp_logan.py
as a single action.In general, I'm still a bit confused about gdb, and I was wondering if you would tell me how you have it set up. How do you compile your C code to use gdb, and how do you start it to debug the C code called from Python?
Some information:
> uname -a Darwin MacBook-Air-6 18.7.0 Darwin Kernel Version 18.7.0: Tue Jun 22 19:37:08 PDT 2021; root:xnu-4903.278.70~1/RELEASE_X86_64 x86_64 > python --version Python 3.8.13 > gdb --version GNU gdb (GDB) 12.1 Copyright (C) 2022 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law.
Hi Marco, let's take this offline. I'll send you a separate email.
Hi all,
I ran demo_3D_shepp_logan.py on the master branch on my 2018 MacBook Air, running Mojave 10.14.6, with 8GB of RAM. The code ran the first two resolutions, and then threw a Segmentation Fault after 13 iterations on the full resolution. I was able to run the code successfully both before and after triggering this error, without having to change anything.
Here is the output for that final resolution: