Get struck when change parallel to cone in single-GPU

GreameLee commented 5 months ago

I can run iterative algorithms such as ossart 2D parallel beam CT reconstruction smooth on ubuntu. But when it comes for cone-beam( for 2d it is fanbeam) The running just gets stuck and prints nothing, even the estimation time. This is the code:

import tigre
import numpy as np
from tigre.utilities import sample_loader
from tigre.utilities import CTnoise
import tigre.algorithms as algs
from scipy.io import loadmat
import matplotlib.pyplot as plt
import torch

geo = tigre.geometry()
# VARIABLE                                   DESCRIPTION                    UNITS
# -------------------------------------------------------------------------------------
# Distances
geo.DSD = 1536  # Distance Source Detector      (mm)
geo.DSO = 1000 # Distance Source Origin        (mm)
# Image parameters
geo.nVoxel = np.array([1, 256, 256])  # number of voxels              (vx)
geo.sVoxel = np.array([1, 256, 256])  # total size of the image       (mm)
geo.dVoxel = geo.sVoxel / geo.nVoxel  # size of each voxel            (mm)
print(geo.dVoxel)
# Detector parameters
geo.nDetector = np.array([1, 1512])  # number of pixels              (px)
geo.dDetector = np.array([geo.dVoxel[0], 0.8])  # size of each pixel            (mm)
# geo.dDetector = np.array([1, 0.1])
geo.sDetector = geo.nDetector * geo.dDetector  # total size of the detector    (mm)
# Offsets
geo.offOrigin = np.array([0, 0, 0])  # Offset of image from origin   (mm)
geo.offDetector = np.array([0, 0])  # Offset of Detector            (mm)
# MAKE SURE THAT THE DETECTOR PIXELS SIZE IN V IS THE SAME AS THE IMAGE!

geo.mode = "cone"

#%% Define angles of projection and load phatom image

angles = np.linspace(0, 2 * np.pi, 180)

img = sample_loader.load_head_phantom(geo.nVoxel)
limited_angle = np.linspace(0, 2*np.pi, 90)
limited_projection = tigre.Ax(img, geo, limited_angle)
tigre.plotSinogram(limited_projection, 0)

niter=500
print("start ossart_tv")
rec_img_ossart_tv = algs.ossart_tv(limited_projection, geo, limited_angle, niter)
plt.imshow(rec_img_ossart_tv.squeeze(), cmap='gray')
plt.savefig('ossart_tv_90.png', bbox_inches='tight', pad_inches=0)
plt.axis('off')

python version: 3.9
OS:Ubuntu 18.04.6 LTS
CUDA version:12.3

AnderBiguri commented 5 months ago

That is strange! if you do it for 3D, does it work well? Can you try to figure out where does it get stuck?

Probably related to #552

GreameLee commented 5 months ago

I try to make a breakpoint and it got stuck in this column:

194        self.set_w()

of iterative_recon_alg.py

AnderBiguri commented 5 months ago

@GreameLee and inside there, do you know where?

GreameLee commented 5 months ago

@AnderBiguri yes, I step inside and it is in

222        W = Ax(
223            np.ones(geox.nVoxel, dtype=np.float32), geox, self.angles, "Siddon", gpuids=self.gpuids
224        )

AnderBiguri commented 5 months ago

I see! I will change this function soon. For now, if you give this function geo instead of geox, I believe it should work.

GreameLee commented 5 months ago

I see! I will change this function soon. For now, if you give this function geo instead of geox, I believe it should work.

Do you mean that replace all geox with geo in iterative_recon_alg.py? Cause there is an assignment about geox before this line:

216     geo = copy.deepcopy(self.geo)

I think geox is geo already

AnderBiguri commented 5 months ago

Yes, in particular in the set_w function.

AnderBiguri commented 5 months ago

I meant:

def set_w(self):
        """
        Calculates value of W if this is not given.
        :return: None
        """
        geo=self.geo
        W = Ax(
            np.ones(geo.nVoxel, dtype=np.float32), geo, self.angles, "Siddon", gpuids=self.gpuids
        )
        W[W <= min(self.geo.dVoxel / 2)] = np.inf
        W = 1.0 / W
        setattr(self, "W", W)

GreameLee commented 5 months ago

Yes, I tried as

def set_w(self):
        """
        Calculates value of W if this is not given.
        :return: None
        """
        geo=self.geo
        W = Ax(
            np.ones(geo.nVoxel, dtype=np.float32), geo, self.angles, "Siddon", gpuids=self.gpuids
        )
        W[W <= min(self.geo.dVoxel / 2)] = np.inf
        W = 1.0 / W
        setattr(self, "W", W)

it but still got struck

AnderBiguri commented 5 months ago

Hum, confusing. I don't really know why it happens then.... I will investigate, but its hard, as I can not make it happen in any of my 5 computers.

GreameLee commented 5 months ago

Are they based on Linux Ubuntu? @AnderBiguri I run it in Ubuntu and it gets struck again in the same line. This issue won't happen in Windows

GreameLee commented 5 months ago

I go deep inside that step and go into the Ax function in utilities, and find it got struck in gpu.py with this line:

31   def __len__(self):
32           return len(self.devices)

AnderBiguri commented 5 months ago

Hum that is strange! What if you change it to return your_number_of_gpus , however many those are?

GreameLee commented 5 months ago

yes, I edit it as :

def __len__(self):
        return 1

But it still get struck in following line of Ax.py:

35  return _Ax_ext(img, geox, geox.angles, projection_type, geox.mode, gpuids=gpuids)

can I ask what is the "_Ax_ext" function? I see that

from _Ax import _Ax_ext

But I did not see any function named _Ax_ext The strange thing is that before ossarttv, there is tigre.Ax using Ax function and that line goes smooth.

AnderBiguri commented 5 months ago

Its _Ax_ext that fails indeed, its the CUDA code for the forward projection. I still not really understand why it fails sometimes only in some machines. If you are using exactly the same geometry any algorithm should work the same. Its really confusing.

GreameLee commented 5 months ago

Its _Ax_ext that fails indeed, its the CUDA code for the forward projection. I still not really understand why it fails sometimes only in some machines. If you are using exactly the same geometry any algorithm should work the same. Its really confusing.

I think this may happen on supercomputers based on Ubuntu. I tried it on another supercomputer but got struck, too. It is kind of serious issue

AnderBiguri commented 5 months ago

I have used ~15 Ubuntu based machines and I can't reproduce :( It is indeed a serious issue.

I suggest the following: replace set_w() such that it returns geo.sVoxel(1). Its not perfect, but it should make the algorithms work.

GreameLee commented 5 months ago

I set:

def set_w(self):
    self.W = self.geo.sVoxel[1]
    return self.W

But this bug happen when it comes to "self.W[ang_index]" of this part in iterative_recon_alg.py:

 ang_index = self.angle_index[iteration].astype(np.int32)

        self.res += (
            self.lmbda
            * 1.0
            / self.V[iteration]
            * Atb(
                self.W[ang_index]

The error shows like:

IndexError: invalid index to scalar variable.

AnderBiguri commented 5 months ago

Ah yes, sorry I made a mistake. its more or less that but not exactly what I said. I don't have time now to fix it.

If you want to fix it yourself:

Understand what W is in size.
Replace it by something of the same size but of value 1/geo.svoxel[1] I'll try to come back and fix this at some point. Apologies, end of the year gets a bit busy .

GreameLee commented 4 months ago

hello, Ander, @AnderBiguri I try to use Windows to retry this, and interesting, it runs smooth for the ossart-tv but when I try to copy the output from CPU to GPU, this error will happen:

RuntimeError: CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

In the same codes, for parallel-beam CT, it does not have this error. I guess when using Ax for cone-beam CT. The _Ax_ext makes the data have wrong leakage so the data can not transfer from CPU to GPU

AnderBiguri commented 4 months ago

Are you using this with pytorch?

On Wed, 26 Jun 2024, 03:08 Haodong, @.***> wrote:

hello, Ander, @AnderBiguri https://github.com/AnderBiguri I try to use windows to retry this, and interesting, when it run smooth for the ossart-tv but when I try to copy the output from CPU to GPU, this error will happened:

RuntimeError: CUDA error: invalid argument CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

— Reply to this email directly, view it on GitHub https://github.com/CERN/TIGRE/issues/560#issuecomment-2190395934, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC2OENF3YVUXJBBNAR7EFOTZJIPC5AVCNFSM6AAAAABJUJYU3WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJQGM4TKOJTGQ . You are receiving this because you were mentioned.Message ID: @.***>

GreameLee commented 4 months ago

Yes, I use tensor_gpu = tensor_cpu.to(device='cuda') to copy tensor from results from tigre

AnderBiguri commented 4 months ago

TIGRE variables you have access too are always in the CPU.

There have been problems in the past relating TIGRE and pytorch (e.g. #509) which we will look into solving

GreameLee commented 4 months ago

I can run cone-beam ossart-tv by running the example code in UBUNTU now.
But when I insert this process in my diffusion model based on Pytorch and tensor in GPU. This reconstruction of ossart-tv will get struck and even I want to use "ctrl+C" to stop it can not be killed. I think I get struck due to the same reason as WINDOWS. And interestingly, all stuff goes smoothly in parallel-beam CT. And I guess this error is same as #509. The _Ax_ext works wrong for cone-beam CT reconstruction. This issue may be because there is some problem when transferring data

AnderBiguri commented 4 months ago

@GreameLee That is actually quite useful information yes.

Just to clarify, TIGRE+pytorch is not suppored yet, as I have not investigated in detail how to make it work. Likely Ubuntu/windows has nothing to do with this problem.

But if it works for parallel and not cone, this is something I can dig into.

GreameLee commented 4 months ago

Look forward to your exploring

AnderBiguri commented 4 months ago

Related: #563

AnderBiguri commented 1 month ago

Hi @GreameLee

I know its been a while, but if you are still playing with this, could you try commenting out this line: https://github.com/CERN/TIGRE/blob/7e8ec5454af09fada9e52fac6d81d91a19a5bbe2/Common/CUDA/Siddon_projection.cu#L584C4-L584C23 , recompiling TIGRE, and then seing if it works now?

AnderBiguri commented 1 month ago

@GreameLee pytorch is not compatible with TIGRE, see demo 25!

CERN / TIGRE

Get struck when change parallel to cone in single-GPU #560