AnderBiguri commented 3 years ago

One would expect that they perform very similarly when called directly. This is indeed true with Ax()

For the AMTLAB geometry

geo=defaultGeometry('nvoxel',[1024 1024 1024],'nDetector',[1024 1024]) I can see that Ax() takes ~ 9s in both MATLAB and python, yet Atb() takes around 6s in MATLAB but closer to 9s in python. This is quite a large difference, yet both use the same CUDA source.

AnderBiguri commented 3 years ago

I have not started checking this issue, but 2 causes that would explain it:

python and MATLAB have different ordering for the data. The new indexing order in CUDA impacts the speed of the GPU, as the kernels are designed for a specific index pattern.
We forgot to remove some explicit copy operation when fixing #152

First one is more likely I think, second one is the easy fix.

tsadakane commented 3 years ago

I modified _Atb.pyx as

    time_start = time.time()   #  ADDED
    if cone_beam:
        if krylov_proj:
            cuda_raise_errors(voxel_backprojection2(...
        else:
            cuda_raise_errors(voxel_backprojection(...
    else:
        cuda_raise_errors(voxel_backprojection_parallel(...
    time_end = time.time()-time_start  #  ADDED
    print("BP: {}".format(time_end))  #  ADDED

to measure the time of BP functions itself. In test.py, Atb call is wrapped by the similar tic-toc. Under the condition of nSize = 1024 (nVoxel, nDetector) and nangles=32, the result was:

BP: 4.323969125747681   <<== cuda voxel_backprojection
timeAtb [s] = 4.436642646789551  <<=== Python Atb function

Atb does not consume extra time. Your first thought seems to be more reasonable.

tsadakane commented 3 years ago

Another experiment is comparison with matlab:

source

matlab


%%
clear;
close all;
%%
gpuids=GpuIds();
gpuids.devices=[0];
disp(gpuids)
%%
nSize = 1024;
geo=defaultGeometry('nvoxel',[nSize; nSize; nSize],'mode', 'cone');
geo.nDetector=[nSize; nSize];   
geo.dDetector=geo.sDetector./geo.nDetector; % total size of the detector    (mm)

nangles = 32;
angles=linspace(0, 2*pi,nangles+1);
angles(length(angles))=[];

fprintf('Start! nSize=%d. nangles=%d\n', nSize, nangles);

timeResults = [];
nTests = 3;
for iter = 1:nTests
    %%
    disp('Ones');
    tic
    head = ones(geo.nVoxel.', 'single');
    timeOnes = toc
    %%
    disp('Ax');
    tic
    proj = Ax(head, geo, angles,'gpuids',gpuids);
    timeAx = toc
    %%
    disp('Atb');
    tic
    backproj = Atb(proj, geo, angles,'gpuids',gpuids);
    timeAtb= toc
    timeResults(iter, 1) = timeOnes;
    timeResults(iter, 2) = timeAx;
    timeResults(iter, 3) = timeAtb;
    clear head;
    clear proj;
    clear backproj;
end
fprintf('nSize=%d. nangles=%d\n', nSize, nangles);
timeResults

Python

#%%
import numpy as np
import time
import tigre
from tigre.utilities import sample_loader
import tigre.utilities.gpu as gpu
import pandas as pd

#%%
gpuids = gpu.getGpuIds(gpu.getGpuNames()[0])
gpuids.devices=[0]

nSize = 1024
geo = tigre.geometry(mode="cone",
                     nVoxel=np.array([nSize, nSize, nSize]),
                     default=True)
geo.nDetector = np.array([nSize, nSize])
geo.dDetector = geo.sDetector / geo.nDetector

nangles = 32
angles = np.linspace(0, 2 * np.pi, nangles, endpoint=False, dtype=np.float32)

df_time = pd.DataFrame()
n_tests = 3
for iter in range(n_tests):

    print("Start! nSize={}, nangles = {}".format(nSize, nangles))
    timeStart = time.time()
    head = np.ones(geo.nVoxel, dtype=np.float32)
    timeHead = time.time()-timeStart
    print("timeHead  [s] = {}".format(timeHead))

    timeStart = time.time()
    proj = tigre.Ax(head, geo, angles, gpuids=gpuids)
    timeAx = time.time()-timeStart
    print("timeAx  [s] = {}".format(timeAx))

    timeStart = time.time()
    backproj = tigre.Atb(proj, geo, angles, gpuids=gpuids)
    timeAtb = time.time()-timeStart

    print("timeAtb [s] = {}".format(timeAtb))

    df_time = df_time.append({"0.head": timeHead, "1.ax": timeAx, "2.atb": timeAtb}, ignore_index=True)
    del head
    del proj
    del backproj

print("nSize={}, nangles = {}".format(nSize, nangles))
print(df_time)

Result

Matlab

nSize=1024. nangles=32

timeResults =

    0.7054    4.8119    4.0001
    0.7478    4.6607    3.2516
    0.7185    4.6761    3.2684

python

nSize=1024, nangles = 32
     0.head      1.ax     2.atb
0  0.684587  5.481678  4.430153
1  0.713924  5.349882  4.247828
2  0.747751  5.788095  4.253292

environment

Windows 10
CUDA 11.1
GeForce GTX 1060 6GB x1
Python 3.8.5 (anaconda)
Matlab 2021a
Visual Studio 2019 16.9.4

tsadakane commented 3 years ago

On another machine, the same test program, nSize=1024, nangles=512, the result was different...

Result 2

Matlab

nSize=1024. nangles=512

timeResults =

    0.5846   10.3175    6.1847
    0.7438   10.2549    6.2533
    0.6993   10.2854    6.3018

Python

nSize=1024, nangles = 512
     0.head      1.ax      2.atb
0  0.600384  6.325863   9.892848
1  0.651254  6.058912   9.941819
2  0.695131  6.094223  10.063829

environment

Windows 10
CUDA 11.1
GeForce RTX 2080 Ti (11GB)
Python 3.8.5 (anaconda)
Matlab 2021a
Visual Studio 2019 16.9.4

AnderBiguri commented 3 years ago

thanks a lot for the detailed tests!

This is very confusing. Can you try the same size in both machines? The fact that times are very different between machines makes some sense, as multi-GPU code is kicking when there is not enough RAM to compute it, but its very strange how the matlab and python versions change that much in comparison. I will also try to run it myselg, on a machine similar to your second one, to see if I can report more times.

tsadakane commented 3 years ago

Parameters

On Python and matlab,		1	2
Env	1x GeForce GTX 1060 6GB	1x GeForce RTX 2080 Ti
Param	nSize=1024. nangles=32	nSize=1024. nangles=512

Eight conditions in total as below. The test program was ran twice for each conditions and the results below were of the second one.

Env 1

MATLAB

Param 1

nSize=1024. nangles=32

timeResults =

    0.6781    4.8520    3.7240
    0.6929    4.6302    3.3594
    0.6849    4.5824    4.0082

Param 2

nSize=1024. nangles=512

timeResults =

    0.6609   34.2021   25.8747
    0.7417   34.0776   26.1431
    0.7357   34.2087   24.9950

Python

Param 1

nSize=1024, nangles = 32
     0.head      1.ax     2.atb
0  0.658428  6.576520  4.431104
1  0.735676  5.504911  4.359733
2  0.660403  5.393731  4.389779

Param 2

nSize=1024, nangles = 512
     0.head       1.ax      2.atb
0  0.675225  18.972264  42.050955
1  0.678241  19.052635  41.914965
2  0.674133  18.769194  42.207841

Env 2

Matlab

Param 1

nSize=1024. nangles=32

timeResults =

    0.5475    2.0845    1.6587
    0.6289    2.0272    1.6679
    0.6326    2.0519    1.6569

Param 2

nSize=1024. nangles=512

timeResults =

    0.5540   10.1706    6.1627
    0.6297   10.2238    6.2279
    0.5347   10.2656    6.2983

Python

Param 1

nSize=1024, nangles = 32
     0.head      1.ax     2.atb
0  0.548758  3.067622  1.822450
1  0.642040  2.763940  1.843832
2  0.593922  2.718240  1.859237

Param 2

nSize=1024, nangles = 512
     0.head      1.ax      2.atb
0  0.562892  6.453038   9.952962
1  0.671913  6.093373  10.015541
2  0.665507  6.140727  10.063664

tsadakane commented 3 years ago

I think nangles=32 case is too small to discuss this issue. The data below are the same as in the comment above, just param1 data are removed:

Parameters

On Python and matlab,		1	2
Env	1x GeForce GTX 1060 6GB	1x GeForce RTX 2080 Ti
Param	~~nSize=1024. nangles=32~~	nSize=1024. nangles=512

~~Eight conditions in total as below.~~

The test program was ran twice for each conditions and the results below were of the second one.

Env 1

MATLAB

Param 2

nSize=1024. nangles=512

timeResults =

    0.6609   34.2021   25.8747
    0.7417   34.0776   26.1431
    0.7357   34.2087   24.9950

Python

Param 2

nSize=1024, nangles = 512
     0.head       1.ax      2.atb
0  0.675225  18.972264  42.050955
1  0.678241  19.052635  41.914965
2  0.674133  18.769194  42.207841

Env 2

Matlab

Param 2

nSize=1024. nangles=512

timeResults =

    0.5540   10.1706    6.1627
    0.6297   10.2238    6.2279
    0.5347   10.2656    6.2983

Python

Param 2

nSize=1024, nangles = 512
     0.head      1.ax      2.atb
0  0.562892  6.453038   9.952962
1  0.671913  6.093373  10.015541
2  0.665507  6.140727  10.063664

We can observe that

Ax of python is faster than of Matlab
Atb of python is slower than of Matlab

I firstly thought that it is due to a bug of test.py/test.m around recording time, but I could not find it.

If you share your parameters and environment, it might be possible to check in my environments if it is not too big 😄 I have another 2080Ti on the second machine, which I didn't use in these tests.

AnderBiguri commented 3 years ago

Note that your experiments are not completely correct, the default geometry in python and MATLAB is not the same (and I shoudl fix that).
You need to give python the following:

geo.DSD = 1536
geo.DSO = 1000

But most importantly, the default Atb() is not the same in python and MATLAB (which is wrong too, and may be the main issue #209):

tigre.Atb(proj, geo, angles, gpuids=gpuids,krylov="FDK")

My own tests:

	1	2
Param	nSize=1024. nangles=32	nSize=1024. nangles=512

On a AMD 16 cores, Ubuntu 18.04, RTX 2080 Super

1

Python:

0  0.423278  2.288260  2.239347
1  0.463425  2.147956  2.143175
2  0.423851  2.164195  2.083993

MATLAB:

    0.9253    2.2317    1.9437
    0.9201    2.1742    1.9317
    0.9144    2.3324    2.0364

2

python:

     0.head      1.ax      2.atb
0  0.443028  6.879069  8.400405
1  0.419899  7.143674  8.527130
2  0.427212  7.120157  8.526652

MATLAB:

    0.9392   13.2615    8.2951
    0.9288   14.0137    8.7923
    0.9408   14.0314    8.8088

Observations

I was wrongly calling Atb(), the times are the same for the FDK backprojector.
Ax with siddon is indeed much faster in python than MATLAB. I can see this happening on the kernel level (using NVIDIA visual profiler) thus something is having a big impact here. Surprisingly, MATLAB uses 32 registers per thread while python uses 34. Maybe some compiler optimization options?
I have run similar tests on the visual profiler for the other 2 projection types. On there I can see that the matched backprojection is also 10% the same for both, but the interpolated forward projection is faster in MATLAB than in python (10s vs 21 s).
My GPU is taking around 30~40s to unregister the memory in my machine, adding an enormous time to the matched backprojector. No idea why.

In summary: Atb() is the same for both, Ax() is not, but one is faster in python and the other in MATLAB.

tsadakane commented 3 years ago

@AnderBiguri

You need to give python the following: geo.DSD = 1536 geo.DSO = 1000

geometry_default.py and defaultGeometry.m both have those values, but geo.accuracy's were different. Python was 0.5 and matlab was 1. Does this affect to Ax?

AnderBiguri commented 3 years ago

@tsadakane yes it does! Not the Siddon version, but yes the interpolated one. The above was for Siddon, it should not matter.

tsadakane commented 3 years ago

The difference was in sDetector. I think your and my geo of dDtector of Matlab is 0.4 while it is 0.8 in python. Adding the lines to test.m,

geo.nDetector=[nSize; nSize];
geo.dDetector=[0.8;0.8];
geo.sDetector=geo.dDetector.*geo.nDetector; % total size of the detector    (mm)

I had the following result with param 2:

1024, 512, 0.557315, Siddon, 5.025336, FDK, 5.833685
1024, 512, 0.533614, Siddon, 4.897123, FDK, 5.869296
1024, 512, 0.543004, Siddon, 4.933322, FDK, 5.896435

AnderBiguri commented 3 years ago

Hi @tsadakane

I've seen you edited the post. Does it mean that we were doing the tests wrong and both are the same?

My unregistered problem seems to be only in this machine, so some particular problem of mine, not the code. That is a relief.

tsadakane commented 3 years ago

@AnderBiguri Sorry for the confusion.

Does it mean that we were doing the tests wrong and both are the same?

Yes, I believe so.

I think this stems from the difference of the defaultGeo functions between matlab and python as you mentioned. If nVoxel is given, python updates its nDetector using the value, while matlab leaves it 512.

AnderBiguri commented 3 years ago

@tsadakane yeah that is correct. I should probably fix that.

In any case, this is good news! I will close the issue then.

CERN / TIGRE

Projection computational time mismatch #296

source

matlab

Python

Result

Matlab

python

environment

Result 2

Matlab

Python

environment

Parameters

Env 1

MATLAB

Param 1

Param 2

Python

Param 1

Param 2

Env 2

Matlab

Param 1

Param 2

Python

Param 1

Param 2

Parameters

Env 1

MATLAB

Param 2

Python

Param 2

Env 2

Matlab

Param 2

Python

Param 2

1

2