CERN / TIGRE

TIGRE: Tomographic Iterative GPU-based Reconstruction Toolbox
BSD 3-Clause "New" or "Revised" License
535 stars 180 forks source link

FBP filtering on GPU #423

Open tsadakane opened 1 year ago

tsadakane commented 1 year ago

Summary

Implemented FBP filtering on GPU using cuFFT.

See #312

Test

Condition

Software Version
Windows 10 64bit
Python(anaconda) 3.10.4
CUDA 11.8
Visual Studio 2022
GPU GTX1060 6GB
Software Version
Windows 10 64bit
MATLAB 2022a
CUDA 11.8
Visual Studio 2022
GPU GTX1060 6GB

Python

import numpy as np
import time

import tigre
import tigre.algorithms as algs
from tigre.utilities import sample_loader
from tigre.utilities.Measure_Quality import Measure_Quality
import tigre.utilities.gpu as gpu
import matplotlib.pyplot as plt

listGpuNames = gpu.getGpuNames()
if len(listGpuNames) == 0:
    print("Error: No gpu found")
else:
    for id in range(len(listGpuNames)):
        print("{}: {}".format(id, listGpuNames[id]))

gpuids = gpu.getGpuIds(listGpuNames[0])
print(gpuids)

mag = 4
# Geometry
# geo1 = tigre.geometry(mode='cone', high_resolution=False, default=True)
geo = tigre.geometry(mode="cone", nVoxel=np.array([256, 256, 256]), default=True)
geo.nDetector = np.array([256, 256])*mag
geo.dDetector = np.array([1.6, 1.6])/mag   # size of each pixel            (mm)
geo.sDetector = geo.dDetector * geo.nDetector
# print(geo)

nangles = 100
angles = np.linspace(0, 2 * np.pi, nangles, endpoint=False, dtype=np.float32)

# Prepare projection data
head = sample_loader.load_head_phantom(geo.nVoxel)
proj = tigre.Ax(head, geo, angles, gpuids=gpuids)
print("proj size = {}".format(proj.shape))

# Reconstruct
niter = 10
print("niter = {}".format(niter))
timer = time.perf_counter
tic = timer()
for _ in range(niter):
    fdkout = algs.fdk(proj, geo, angles, gpuids=gpuids, use_gpu_for_filtration=False)
toc = timer()
print("CPU: {}".format(toc-tic))
tic = timer()
for _ in range(niter):
    ossart = algs.fdk(proj, geo, angles, gpuids=gpuids, use_gpu_for_filtration=True)
toc = timer()
print("GPU: {}".format(toc-tic))

# Measure Quality
# 'RMSE', 'MSSIM', 'SSD', 'UQI'
print("RMSE fdk CPU: ")
print(Measure_Quality(fdkout, head, ["nRMSE"]))
print("RMSE fdk GPU: ")
print(Measure_Quality(ossart, head, ["nRMSE"]))

# Plot
fig, axes = plt.subplots(3, 3)
axes[0, 0].set_title("FDK FFT on CPU")
axes[0, 0].imshow(fdkout[geo.nVoxel[0] // 2])
axes[1, 0].imshow(fdkout[:, geo.nVoxel[1] // 2, :])
axes[2, 0].imshow(fdkout[:, :, geo.nVoxel[2] // 2])
axes[0, 1].set_title("FDK FFT on GPU")
axes[0, 1].imshow(ossart[geo.nVoxel[0] // 2])
axes[1, 1].imshow(ossart[:, geo.nVoxel[1] // 2, :])
axes[2, 1].imshow(ossart[:, :, geo.nVoxel[2] // 2])
axes[0, 2].set_title("GPU-CPU")
axes[0, 2].imshow(ossart[geo.nVoxel[0] // 2]      -fdkout[geo.nVoxel[0] // 2])
axes[1, 2].imshow(ossart[:, geo.nVoxel[1] // 2, :]-fdkout[:, geo.nVoxel[1] // 2, :])
axes[2, 2].imshow(ossart[:, :, geo.nVoxel[2] // 2]-fdkout[:, :, geo.nVoxel[2] // 2])
plt.show()
# tigre.plotProj(proj)
# tigre.plotImg(fdkout)

Matlab

clear;
close all;
%% Geometry
geo=defaultGeometry('nVoxel',[512;512;256]);                     

%% Load data and generate projections 
% define angles
angles=linspace(0,2*pi,100);
% Load thorax phatom data
head=headPhantom(geo.nVoxel);
% generate projections
projections=Ax(head,geo,angles,'interpolated');
sprintf("size of projections = %s", mat2str(size(projections)))

niter = 10;
sprintf("niter = %d", niter)

disp("CPU")
tic
for ii=1:niter
% FDK with CPU filtering
imgFDK_CPU=FDK(projections,geo,angles,'usegpufft',false);
end
toc
disp("GPU")
tic
for ii=1:niter
% FDK with GPU filtering
imgFDK_GPU=FDK(projections,geo,angles,'usegpufft',true);
end
toc
% Show the results
%plotImg([imgFDK_CPU,imgFDK_GPU],'Dim','Z');
%plotImg([imgFDK_GPU],'Dim','Z');
imshow([imgFDK_CPU(:,:,50), imgFDK_GPU(:,:,50), imgFDK_CPU(:,:,50)- imgFDK_GPU(:,:,50)]);

Discussion

This code consumes more time than CPU even when the projections are size of 2048x2048.

  1. I found that https://github.com/CERN/TIGRE/blob/98fa77f163bde8f24421b2de901e4ccc2a813d99/Common/CUDA/FbpFiltration.cu#L109-L113

and

https://github.com/CERN/TIGRE/blob/98fa77f163bde8f24421b2de901e4ccc2a813d99/Common/CUDA/FbpFiltration.cu#L158-L162

consumes time, while FFT and multiplying the filter in the Fourier space is fast enough. These are necessary as long as we use FFT C2C. R2C and C2R may work.

  1. As for now, the CUDA codes are common for Matlab and Python because filtering.m transposes the matrix before calling fft.

  2. Only one GPU is used.

  3. Only one stream is used.

tsadakane commented 1 year ago

R2C and C2R may work.

Worked. (a39e6b2)

Results of python test code

Conditions

0: NVIDIA GeForce GTX 1060 6GB {'name': 'NVIDIA GeForce GTX 1060 6GB', 'devices': [0]} proj size = (100, 1024, 1024) niter = 10

Before

CPU: 24.03577409999707 s GPU: 28.770307099999627 s RMSE fdk CPU: 526.5072514306678 RMSE fdk GPU: 526.5072768954007

After

CPU: 24.188547600002494 s GPU: 12.808391700003995 s RMSE fdk CPU: 526.5072514306678 RMSE fdk GPU: 526.5072259659336

tsadakane commented 1 year ago

@AnderBiguri Do you think we should keep the option to use/not to use GPU for FBP filtering for filtering and FDK function?

AnderBiguri commented 1 year ago

@tsadakane Fantastic work, honestly!

I am actually surprised that GPU is not much faster, but still the speedup achieved is quite significant. Perhaps you are running in a CPU with many cores?

I will test this on my side in a couple of machines (old/new) and report speed (sometime next week). I think if we can more or less verify that the filtering is always faster on GPU, we can just remove the CPU option from FBP/FDK. We can always leave the matlab/python code there, deprecated, for people to be able to read what the CUDA part does, but in an easier mode. What do you think?

tsadakane commented 1 year ago

@AnderBiguri Thank you for your reply. I think it would be better to remove the old code if possible. Let me know the result and I can remove the option before submitting this PR.

I am actually surprised that GPU is not much faster, but still the speedup achieved is quite significant. Perhaps you are running in a CPU with many cores?

The CPU is i7-11700 (8 cores, 16 threads). I found that only about 2 sec. in 12.8 sec. of GPU was creating and destroying FFT "plan"s and FFT, multipling filter and IFFT, other 7 sec. is consumed before and after calling the host function in FbpFiltration.cu, apply_filtration.

AnderBiguri commented 1 year ago

@tsadakane fantastic, I think its not a bad idea, it will live in git's history anyway, so you are right. I will test on a few different machines to ensure we can reliably say the GPU is faster all the time.

In your time comments, you are saying that most of the time (7s) is on the FFT maths, and the rest is memory and context management?

AnderBiguri commented 1 year ago

In my i7-7700HQ/GTX 1070 laptop, GPU is often slower:

MATLAB:

geo.nDetect 256^2, nagles=100, niter=10:

CPU Elapsed time is 6.132782 seconds. GPU Elapsed time is 8.440495 seconds.

geo.nDetect 512^2, nagles=100, niter=10:

CPU Elapsed time is 16.697335 seconds. GPU Elapsed time is 18.622791 seconds.

geo.nDetect 1024^2, nagles=100, niter=10:

CPU Elapsed time is 54.229799 seconds. GPU Elapsed time is 51.912400 seconds.

I still am quite surprised by this performance, I really did think it would be much faster. I'll try to dig a bit further, I am not sure if there is something extra we can optimize in the code that we may not be aware of, or if my assumption that it would be much faster is seriously flawed.

tsadakane commented 1 year ago

In your time comments, you are saying that most of the time (7s) is on the FFT maths, and the rest is memory and context management?

Sorry for unclear English. I meant...

(*) In apply_filtration, commenting out the lines of creating and destroying "plan", which nvidia calls the FFT object, and the lines in the curly bracket, the elapsed time reduced 2sec from 12sec. (became 10sec). In the 2 sec., 0.7 sec is for FFT, multiplying filter and IFFT, and 1.4 sec. is for creating/destroying the plan.

(**) Returning at the beginning of apply_filtration, the time became 7sec from 12sec.

AnderBiguri commented 1 year ago

@tsadakane I see, thanks. So if we were to try to optimize this we'd need to look at the python<->cuda interoperability. I need to sit down and read the code more carefully, as this seems surprisingly high? What do you think?

tsadakane commented 1 year ago

I have no idea so far.

tsadakane commented 1 year ago

I wrote multi-gpu version and cudaMemcpyAsync version, but as expected, the both were no effect.

tsadakane commented 1 year ago

Just removing padding from filtering function in filtering.py reduced 3 sec., so I moved the padding into cuda function. This reduces the amount of memory to transfer. For effective strided data copy, I used cudaMemcpy2D, then GPU is now three times faster than GPU.

0: NVIDIA GeForce GTX 1060 6GB
{'name': 'NVIDIA GeForce GTX 1060 6GB', 'devices': [0]}
proj size = (100, 1024, 1024)
niter = 10
CPU: 23.939920199998596
GPU: 7.977493500002311
RMSE fdk CPU: 
526.5072514306678
RMSE fdk GPU: 
526.5072514306678

filtering.py will be as

def filtering(proj, geo, angles, parker, verbose=False, use_gpu=True, gpuids=None):
    if parker:
        proj=parkerweight(proj.transpose(0,2,1),geo,angles,parker).transpose(0,2,1)

    filt_len=max(64,2**nextpow2(2*max(geo.nDetector)))
    ramp_kernel=ramp_flat(filt_len)

    d=1
    filt=filter(geo.filter,ramp_kernel[0],filt_len,d,verbose=verbose)

    padding = int((filt_len-geo.nDetector[1])//2 )
    scale_factor = (geo.DSD[0]/geo.DSO[0]) * (2 * np.pi/ len(angles)) / ( 4 * geo.dDetector[0] ) 
    if use_gpu:
        bundle_size = 32 #len(gpuids)
        n_bundles = (angles.shape[0]+bundle_size-1)//bundle_size
        for idx_bundle in range(0,n_bundles):
            bundle_size_actual = bundle_size if idx_bundle != n_bundles-1 else angles.shape[0]-idx_bundle*bundle_size
            idx_begin = idx_bundle*bundle_size
            idx_end = idx_bundle*bundle_size+bundle_size_actual
            proj[idx_begin:idx_end] = fbpfiltration.apply2(proj, idx_begin, idx_end, filt, scale_factor, gpuids) 

    else:
        filt=np.kron(np.ones((np.int64(geo.nDetector[0]),1),dtype=np.float32),filt)

        #filter 2 projection at a time packing in to complex container
        fproj=np.empty((geo.nDetector[0],filt_len),dtype=np.complex64)
        for i in range(0,angles.shape[0]-1,2):
            fproj.fill(0)
            fproj.real[:,padding:padding+geo.nDetector[1]]=proj[i]
            fproj.imag[:,padding:padding+geo.nDetector[1]]=proj[i+1]

            fproj=fft(fproj,axis=1)
            fproj=fproj*filt
            fproj=ifft(fproj,axis=1)

            proj[i]=fproj.real[:,padding:padding+geo.nDetector[1]] * scale_factor
            proj[i+1]=fproj.imag[:,padding:padding+geo.nDetector[1]] * scale_factor

        #if odd number of projections filter last solo
        if angles.shape[0] % 2:
            fproj.fill(0)
            fproj.real[:,padding:padding+geo.nDetector[1]]=proj[angles.shape[0]-1]

            fproj=fft(fproj,axis=1)
            fproj=fproj*filt
            fproj=np.real(ifft(fproj,axis=1))     
            proj[angles.shape[0]-1]=fproj[:,padding:padding+geo.nDetector[1]] * scale_factor

    return proj

I'm worried that by moving "padding" to cuda, the "if GPU" code is not parallel to CPU. What do you think?

I have not yet run this modification on Matlab and not yet commited.

AnderBiguri commented 1 year ago

@tsadakane interesting!

Sorry, I am a bit busy this week, perhaps next week, so I don't have time to look at it with the detail it deserves. I intend to have a serious look at it as soon as I can.

tsadakane commented 1 year ago

About aa51ec6

"CPU" uses CPU for convolution. In "GPU 1", the convolution is processed in a GPU, even if gpuids includes multiple GPUs. (=a39e6b2) In "GPU 2", the padding and convolution are processed in GPUs.

Tests

Python

test code

import numpy as np
import time

import tigre
import tigre.algorithms as algs
from tigre.utilities import sample_loader
from tigre.utilities.Measure_Quality import Measure_Quality
import tigre.utilities.gpu as gpu
import matplotlib.pyplot as plt

listGpuNames = gpu.getGpuNames()
if len(listGpuNames) == 0:
    print("Error: No gpu found")
else:
    for id in range(len(listGpuNames)):
        print("{}: {}".format(id, listGpuNames[id]))

gpuids = gpu.getGpuIds(listGpuNames[0])
print(gpuids)

mag = 4
# Geometry
# geo1 = tigre.geometry(mode='cone', high_resolution=False, default=True)
geo = tigre.geometry(mode="cone", nVoxel=np.array([256, 256, 256]), default=True)
geo.nDetector = np.array([256, 256])*mag
geo.dDetector = np.array([1.6, 1.6])/mag   # size of each pixel            (mm)
geo.sDetector = geo.dDetector * geo.nDetector
# print(geo)

nangles = 100
angles = np.linspace(0, 2 * np.pi, nangles, endpoint=False, dtype=np.float32)

# Prepare projection data
head = sample_loader.load_head_phantom(geo.nVoxel)
proj = tigre.Ax(head, geo, angles, gpuids=gpuids)
print("proj size = {}".format(proj.shape))

# Reconstruct
niter = 10
print("niter = {}".format(niter))
timer = time.perf_counter

tic = timer()
for _ in range(niter):
    out_cpu = algs.fdk(proj, geo, angles, gpuids=gpuids, use_gpu_for_filtration=0)
toc = timer()
print("CPU  : {}".format(toc-tic))
tic = timer()
for _ in range(niter):
    out_gpu1 = algs.fdk(proj, geo, angles, gpuids=gpuids, use_gpu_for_filtration=1)
toc = timer()
print("GPU 1: {}".format(toc-tic))

tic = timer()
for _ in range(niter):
    out_gpu2 = algs.fdk(proj, geo, angles, gpuids=gpuids, use_gpu_for_filtration=2)
toc = timer()
print("GPU 2: {}".format(toc-tic))

# Measure Quality
# 'RMSE', 'MSSIM', 'SSD', 'UQI'
print("RMSE fdk CPU: ")
print(Measure_Quality(out_cpu, head, ["nRMSE"]))
print("RMSE fdk GPU 1: ")
print(Measure_Quality(out_gpu1, head, ["nRMSE"]))
print("RMSE fdk GPU 2: ")
print(Measure_Quality(out_gpu2, head, ["nRMSE"]))

# Plot
fig, axes = plt.subplots(3, 5)
axes[0, 0].set_title("FDK FFT on CPU")
axes[0, 0].imshow(out_cpu[geo.nVoxel[0] // 2])
axes[1, 0].imshow(out_cpu[:, geo.nVoxel[1] // 2, :])
axes[2, 0].imshow(out_cpu[:, :, geo.nVoxel[2] // 2])
axes[0, 1].set_title("FDK FFT on GPU")
axes[0, 1].imshow(out_gpu1[geo.nVoxel[0] // 2])
axes[1, 1].imshow(out_gpu1[:, geo.nVoxel[1] // 2, :])
axes[2, 1].imshow(out_gpu1[:, :, geo.nVoxel[2] // 2])
axes[0, 2].set_title("FDK FFT on GPU 2")
axes[0, 2].imshow(out_gpu2[geo.nVoxel[0] // 2])
axes[1, 2].imshow(out_gpu2[:, geo.nVoxel[1] // 2, :])
axes[2, 2].imshow(out_gpu2[:, :, geo.nVoxel[2] // 2])
axes[0, 3].set_title("GPU1-CPU")
axes[0, 3].imshow(out_gpu1[geo.nVoxel[0] // 2]      -out_cpu[geo.nVoxel[0] // 2])
axes[1, 3].imshow(out_gpu1[:, geo.nVoxel[1] // 2, :]-out_cpu[:, geo.nVoxel[1] // 2, :])
axes[2, 3].imshow(out_gpu1[:, :, geo.nVoxel[2] // 2]-out_cpu[:, :, geo.nVoxel[2] // 2])
axes[0, 4].set_title("GPU2-CPU")
axes[0, 4].imshow(out_gpu2[geo.nVoxel[0] // 2]      -out_cpu[geo.nVoxel[0] // 2])
axes[1, 4].imshow(out_gpu2[:, geo.nVoxel[1] // 2, :]-out_cpu[:, geo.nVoxel[1] // 2, :])
axes[2, 4].imshow(out_gpu2[:, :, geo.nVoxel[2] // 2]-out_cpu[:, :, geo.nVoxel[2] // 2])
plt.show()

Result

Software Version
Windows 10 64bit
Python(Intel) 3.9.15
CUDA 11.8
Visual Studio 2022
GPU 2x NVIDIA GeForce RTX 2080 Ti
0: NVIDIA GeForce RTX 2080 Ti
1: NVIDIA GeForce RTX 2080 Ti
2: NVIDIA GeForce GTX 1070
{'name': 'NVIDIA GeForce RTX 2080 Ti', 'devices': [0, 1]}
proj size = (100, 1024, 1024)
niter = 10
[path to tigre working dir.]\Python\tigre\utilities\filtering.py:80: RuntimeWarning: divide by zero encountered in true_divide
  h[odd] = -1 / (np.pi * nn[odd]) ** 2
CPU  : 35.7683762
GPU 1: 19.560339300000003
GPU 2: 12.197235299999996
RMSE fdk CPU: 
526.5050614590306
RMSE fdk GPU 1: 
526.504934134818
RMSE fdk GPU 2: 
526.504934134818

Malab

Software Version
Windows 10 64bit
MATLAB 2022a
CUDA 11.8
Visual Studio 2022
GPU 2x NVIDIA GeForce RTX 2080 Ti

Test code

clear;
close all;
%% Geometry
geo=defaultGeometry('nVoxel',[512;512;256], "nDetector", [1024;1024]);                     

%% GPUs
gpuids = GpuIds('NVIDIA GeForce RTX 2080 Ti');
disp(gpuids);
disp(gpuids.devices);

%% Load data and generate projections 
% define angles
angles=linspace(0,2*pi,100);
% Load thorax phatom data
head=headPhantom(geo.nVoxel);
% generate projections
projections=Ax(head,geo,angles,'interpolated', 'gpuids', gpuids);
sprintf("size of projections = %s", mat2str(size(projections)))

niter = 10;%10;
sprintf("niter = %d", niter)

disp("CPU")
tic
for ii=1:niter
% FDK with CPU filtering
imgFDK_CPU=FDK(projections,geo,angles,'usegpufft',0, 'gpuids', gpuids);
end
toc
disp("GPU 1")
tic
for ii=1:niter
% FDK with GPU filtering
imgFDK_GPU1=FDK(projections,geo,angles,'usegpufft',1, 'gpuids', gpuids);
end
toc
disp("GPU 2")
tic
for ii=1:niter
% FDK with GPU filtering
imgFDK_GPU2=FDK(projections,geo,angles,'usegpufft',2, 'gpuids', gpuids);
end
toc
% Show the results
%plotImg([imgFDK_CPU,imgFDK_GPU],'Dim','Z');
%plotImg([imgFDK_GPU],'Dim','Z');
imshow([imgFDK_CPU(:,:,50), imgFDK_GPU1(:,:,50), imgFDK_GPU2(:,:,50), abs(imgFDK_GPU1(:,:,50) - imgFDK_CPU(:,:,50)), abs(imgFDK_GPU2(:,:,50)- imgFDK_CPU(:,:,50))]);
%imshow([imgFDK_GPU(:,:,2)]);

Result

  GpuIds with properties:

       name: 'NVIDIA GeForce RTX 2080 Ti'
    devices: [0 1]
   0   1
ans = 
    "size of projections = [1024 1024 100]"
ans = 
    "niter = 10"
CPU
Elapsed time is 35.826591 seconds.
GPU 1
Elapsed time is 36.107104 seconds.
GPU 2
Elapsed time is 29.146955 seconds.
tsadakane commented 1 year ago

The table below is the summary of the results above. The result of python is acceptable for me, but the result of matlab is hard to understand. At first, I suspected that only one GPU is used due to a bug, but when I checked, deviceCount in apply_filtration2 function was 2.

Python [s] Matlab [s]
CPU 35.8 35.8
GPU 1 (Filtering only) 19.6 36.1
GPU 2 (Padding and Filtering) 12.2 29.1

Here, two GPUs are used.

tsadakane commented 1 year ago

I ran the python test program above after adding gpuids.devices.pop(-1) to use only one GPU.

Python 2x GPU [s], 1x GPU [s]
CPU 35.9 33.8
GPU 1 (Filtering only) 19.2 16.8
GPU 2 (Padding and Filtering) 12.1 10.4

It seems that the overhead of the copying is larger than the benefit of doubling the GPU resource.

AnderBiguri commented 1 year ago

@tsadakane that makes sense. Stil surprised about the times, but I think its my own false assumptions of how fast it should be. Still really good :)

AnderBiguri commented 1 year ago

I am also a bit baffled by the MATLAB slowness, maybe something to do with extra copies? let me check.

tsadakane commented 1 year ago

I think matlab code (copies &) transposes the matrix; the fastest variable in the memory is v in matlab while in python it is u, but in the cuda code, there is no difference between matlab and python.

AnderBiguri commented 1 year ago

@tsadakane ah! We may be able to do then the same that we do with the other cuda code? #if IS_FOR_MATLAB_TIGRE kind of thing? I am not entirely sure where this is applied in the new CUDA code, happy to have a look if you want, but if you know where it happens in CUDA, feel free to do it yourself.

Thanks again, your work is really appreciated!

tsadakane commented 1 year ago

I can do it, but I am not sure if it works, because cuda is not good at transposing.

It was not necessary to transpose the projection in ax and atb, because it is copied to the 3d texture, where there is no difference in the memory access efficiency between u and v axes. On the other hand, the convolution must be applied to the u-direction, so transposing in the memory layout is necessary.

That's said, I think, from the point of view of symmetry, permutation line should be moved to the filtering function if possible, i.e https://github.com/CERN/TIGRE/blob/f9a38edfa2eecabb724f1d9759eefdb041bbaf8d/MATLAB/Algorithms/FDK.m#L61-L77 =>

%% Weight
% proj=permute(proj,[2 1 3]);  %%%%%%%%%% Detele
for ii=1:size(angles,2)
    us = ((-geo.nDetector(1)/2+0.5):1:(geo.nDetector(1)/2-0.5))*geo.dDetector(1) + offset(1,ii);
    vs = ((-geo.nDetector(2)/2+0.5):1:(geo.nDetector(2)/2-0.5))*geo.dDetector(2) + offset(2,ii);
    [uu,vv] = meshgrid(us,vs); % detector

    % Create weight according to each detector element
    w = (geo.DSD(ii))./sqrt((geo.DSD(ii))^2+uu.^2 + vv.^2);

    % Multiply the weights with projection data
    proj(:,:,ii) = w.*proj(:,:,ii);   %%%%%%%%%% Transpose
end
%% Fourier transform based filtering
%proj=permute(proj,[2 1 3]);   %%%%%%%%%% Add this line at the beginning of the filtering function.
proj = filtering(proj,geo,angles,parker); % Not sure if offsets are good in here

This modification makes the code a little bit beautiful, but the specification of the filtering function will be changed (the first arg proj is not transposed).

What do you think?

AnderBiguri commented 1 year ago

@tsadakane I see what you mean with the filtering. My convolution/fft maths are a bit rusty, so I may be completely wrong, but isn't there a way to change the shape of the filter itself so it is equivalent to transposing the projections? That would be the ideal solution, if possible, right?

About changing the transpose to inside filtering, 100% agree is much elegant.

AnderBiguri commented 1 year ago

Hi @tsadakane, sorry I have been very busy with my life. How is this going? do you need any help? Is it done?

tsadakane commented 1 year ago

Hi @AnderBiguri, I've been busy lately too. I've implemented the code to transpose matrices, but I am not satisfied with the result and trying several granularities of processing to pass to mex/cuda.

AnderBiguri commented 1 year ago

Aparently making the fft and ifft calls in python have the workers argument can accelerate them. fproj = fft(fproj, axis=1, workers=cores), given cores=os.cpu_count(). Haven't tried it myself.

tsadakane commented 1 year ago

Under the condition of

proj size = (100, 1024, 1024)
nVoxel = [256, 512, 512]

On 11th Gen Intel(R) Core(TM) i7-11700 @ 2.50GHz (8 physical core + hyper-threading => cores=16) + GTX1060

0, CPU: Padding & FFT       : 26.59433160000026 s @ 10 times
1, CPU: Multi-Workers       : 15.725976400000036 s @ 10 times

Not bad.

AnderBiguri commented 1 year ago

I think the GPU implementation can be valuable too. I think the fastest FDK would be one that does interleave filtering and backprojection. I read a paper about it at some point.

AnderBiguri commented 4 months ago

Hi @tsadakane What do you think about all this? worth merging it for the cases when its faster?

tsadakane commented 4 months ago

Hi, @AnderBiguri ,

It's difficult to answer, but I'd like to say no.

AnderBiguri commented 4 months ago

@tsadakane makes sense!

Flexibility is a bit less important because we can add a flag with default CPU, but I think the rest of the points make total sense. Indeed, this would make sense if there was a FFT+Backproject in the same CUDA call, but then the multi-GPU code gets complex.

It may be worth leaving the PR here (if you are happy with that), in case anyone in the future wants to use the code?