Closed GreameLee closed 1 month ago
That is strange! if you do it for 3D, does it work well? Can you try to figure out where does it get stuck?
Probably related to #552
I try to make a breakpoint and it got stuck in this column:
194 self.set_w()
of iterative_recon_alg.py
@GreameLee and inside there, do you know where?
@AnderBiguri yes, I step inside and it is in
222 W = Ax(
223 np.ones(geox.nVoxel, dtype=np.float32), geox, self.angles, "Siddon", gpuids=self.gpuids
224 )
I see! I will change this function soon. For now, if you give this function geo
instead of geox
, I believe it should work.
I see! I will change this function soon. For now, if you give this function
geo
instead ofgeox
, I believe it should work.
Do you mean that replace all geox with geo in iterative_recon_alg.py? Cause there is an assignment about geox before this line:
216 geo = copy.deepcopy(self.geo)
I think geox is geo already
Yes, in particular in the set_w
function.
I meant:
def set_w(self):
"""
Calculates value of W if this is not given.
:return: None
"""
geo=self.geo
W = Ax(
np.ones(geo.nVoxel, dtype=np.float32), geo, self.angles, "Siddon", gpuids=self.gpuids
)
W[W <= min(self.geo.dVoxel / 2)] = np.inf
W = 1.0 / W
setattr(self, "W", W)
Yes, I tried as
def set_w(self):
"""
Calculates value of W if this is not given.
:return: None
"""
geo=self.geo
W = Ax(
np.ones(geo.nVoxel, dtype=np.float32), geo, self.angles, "Siddon", gpuids=self.gpuids
)
W[W <= min(self.geo.dVoxel / 2)] = np.inf
W = 1.0 / W
setattr(self, "W", W)
it but still got struck
Hum, confusing. I don't really know why it happens then.... I will investigate, but its hard, as I can not make it happen in any of my 5 computers.
Are they based on Linux Ubuntu? @AnderBiguri I run it in Ubuntu and it gets struck again in the same line. This issue won't happen in Windows
I go deep inside that step and go into the Ax function in utilities, and find it got struck in gpu.py with this line:
31 def __len__(self):
32 return len(self.devices)
Hum that is strange! What if you change it to return your_number_of_gpus
, however many those are?
yes, I edit it as :
def __len__(self):
return 1
But it still get struck in following line of Ax.py:
35 return _Ax_ext(img, geox, geox.angles, projection_type, geox.mode, gpuids=gpuids)
can I ask what is the "_Ax_ext" function? I see that
from _Ax import _Ax_ext
But I did not see any function named _Ax_ext The strange thing is that before ossarttv, there is tigre.Ax using Ax function and that line goes smooth.
Its _Ax_ext
that fails indeed, its the CUDA code for the forward projection. I still not really understand why it fails sometimes only in some machines. If you are using exactly the same geometry any algorithm should work the same. Its really confusing.
Its
_Ax_ext
that fails indeed, its the CUDA code for the forward projection. I still not really understand why it fails sometimes only in some machines. If you are using exactly the same geometry any algorithm should work the same. Its really confusing.
I think this may happen on supercomputers based on Ubuntu. I tried it on another supercomputer but got struck, too. It is kind of serious issue
I have used ~15 Ubuntu based machines and I can't reproduce :( It is indeed a serious issue.
I suggest the following: replace set_w()
such that it returns geo.sVoxel(1)
. Its not perfect, but it should make the algorithms work.
I set:
def set_w(self):
self.W = self.geo.sVoxel[1]
return self.W
But this bug happen when it comes to "self.W[ang_index]" of this part in iterative_recon_alg.py:
ang_index = self.angle_index[iteration].astype(np.int32)
self.res += (
self.lmbda
* 1.0
/ self.V[iteration]
* Atb(
self.W[ang_index]
The error shows like:
IndexError: invalid index to scalar variable.
Ah yes, sorry I made a mistake. its more or less that but not exactly what I said. I don't have time now to fix it.
If you want to fix it yourself:
1/geo.svoxel[1]
I'll try to come back and fix this at some point. Apologies, end of the year gets a bit busy .hello, Ander, @AnderBiguri I try to use Windows to retry this, and interesting, it runs smooth for the ossart-tv but when I try to copy the output from CPU to GPU, this error will happen:
RuntimeError: CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
In the same codes, for parallel-beam CT, it does not have this error.
I guess when using Ax for cone-beam CT. The _Ax_ext
makes the data have wrong leakage so the data can not transfer from CPU to GPU
Are you using this with pytorch?
On Wed, 26 Jun 2024, 03:08 Haodong, @.***> wrote:
hello, Ander, @AnderBiguri https://github.com/AnderBiguri I try to use windows to retry this, and interesting, when it run smooth for the ossart-tv but when I try to copy the output from CPU to GPU, this error will happened:
RuntimeError: CUDA error: invalid argument CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.— Reply to this email directly, view it on GitHub https://github.com/CERN/TIGRE/issues/560#issuecomment-2190395934, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC2OENF3YVUXJBBNAR7EFOTZJIPC5AVCNFSM6AAAAABJUJYU3WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJQGM4TKOJTGQ . You are receiving this because you were mentioned.Message ID: @.***>
Yes, I use tensor_gpu = tensor_cpu.to(device='cuda') to copy tensor from results from tigre
TIGRE variables you have access too are always in the CPU.
There have been problems in the past relating TIGRE and pytorch (e.g. #509) which we will look into solving
I can run cone-beam ossart-tv by running the example code in UBUNTU now.
But when I insert this process in my diffusion model based on Pytorch and tensor in GPU. This reconstruction of ossart-tv will get struck and even I want to use "ctrl+C" to stop it can not be killed. I think I get struck due to the same reason as WINDOWS.
And interestingly, all stuff goes smoothly in parallel-beam CT. And I guess this error is same as #509.
The _Ax_ext
works wrong for cone-beam CT reconstruction. This issue may be because there is some problem when transferring data
@GreameLee That is actually quite useful information yes.
Just to clarify, TIGRE+pytorch is not suppored yet, as I have not investigated in detail how to make it work. Likely Ubuntu/windows has nothing to do with this problem.
But if it works for parallel and not cone, this is something I can dig into.
Look forward to your exploring
Related: #563
Hi @GreameLee
I know its been a while, but if you are still playing with this, could you try commenting out this line: https://github.com/CERN/TIGRE/blob/7e8ec5454af09fada9e52fac6d81d91a19a5bbe2/Common/CUDA/Siddon_projection.cu#L584C4-L584C23 , recompiling TIGRE, and then seing if it works now?
@GreameLee pytorch is not compatible with TIGRE, see demo 25!
I can run iterative algorithms such as ossart 2D parallel beam CT reconstruction smooth on ubuntu. But when it comes for cone-beam( for 2d it is fanbeam) The running just gets stuck and prints nothing, even the estimation time. This is the code: