Closed nikitinvv closed 4 years ago
Hello @nikitinvv! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
There are currently no PEP 8 issues detected in this Pull Request. Cheers! :beers:
Currently, the adjoint operator is very slow because it involves computing USFFT from unequally-spaced to equally-spaced grid, which is based on computing the 'scattering' operation...
One of the issues that you raise is that atomic operations during the adjoint operation are expensive. However, every scattering operation can be expressed as a gathering operation. i.e. instead of scattering point to grid-spaces, gather grid-spaces from points. In the case where the number of grid-space and points are the same, I think this should have the same computational complexity without atomic operations.
Use pyinstrument to benchmark the conjugate gradient algorithm. ...
_ ._ __/__ _ _ _ _ _/_ Recorded: 14:53:53 Samples: 97110
/_//_/// /_\ / //_// / //_'/ // Duration: 100.870 CPU time: 100.826
/ _/ v3.1.3
Program: profile_lamino.py
100.870 template_algorithm profile_lamino.py:36
└─ 100.870 reconstruct tike/lamino/lamino.py:88
└─ 100.276 lamcg tike/lamino/solvers/lamcg.py:8
└─ 100.276 update_obj tike/lamino/solvers/lamcg.py:21
└─ 100.276 conjugate_gradient tike/opt.py:79
├─ 63.251 grad tike/lamino/solvers/lamcg.py:26
│ └─ 63.251 grad tike/operators/numpy/lamino.py:62
│ ├─ 44.352 adj tike/operators/numpy/lamino.py:49
│ │ └─ 44.300 us2eq tike/operators/numpy/usfft.py:75
│ │ ├─ 16.849 nonzero cupy/sorting/search.py:120
│ │ │ [11 frames hidden] cupy, <__array_function__ internals>,...
│ │ ├─ 11.385 [self]
│ │ ├─ 10.067 bincount cupy/statistics/histogram.py:100
│ │ │ [18 frames hidden] cupy, <__array_function__ internals>,...
│ │ ├─ 3.165 can_cast <__array_function__ internals>:2
│ │ │ [4 frames hidden] <__array_function__ internals>, numpy
│ │ └─ 2.489 <lambda> tike/operators/numpy/usfft.py:103
│ │ └─ 1.534 [self]
│ └─ 18.894 fwd tike/operators/numpy/lamino.py:38
│ └─ 18.832 eq2us tike/operators/numpy/usfft.py:10
│ ├─ 9.336 [self]
│ ├─ 5.797 <lambda> tike/operators/numpy/usfft.py:42
│ │ ├─ 3.829 [self]
│ │ └─ 1.959 can_cast <__array_function__ internals>:2
│ │ [4 frames hidden] <__array_function__ internals>, numpy
│ └─ 3.129 can_cast <__array_function__ internals>:2
│ [4 frames hidden] <__array_function__ internals>, numpy
└─ 37.000 line_search tike/opt.py:16
└─ 36.989 cost_function tike/lamino/solvers/lamcg.py:23
└─ 36.989 cost tike/operators/numpy/lamino.py:58
└─ 36.975 fwd tike/operators/numpy/lamino.py:38
└─ 36.849 eq2us tike/operators/numpy/usfft.py:10
├─ 18.213 [self]
├─ 11.718 <lambda> tike/operators/numpy/usfft.py:42
│ ├─ 7.818 [self]
│ └─ 3.899 can_cast <__array_function__ internals>:2
│ [4 frames hidden] <__array_function__ internals>, numpy
└─ 6.394 can_cast <__array_function__ internals>:2
[4 frames hidden] <__array_function__ internals>, numpy
ok
----------------------------------------------------------------------
Ran 1 test in 106.832s
It looks like the nonzero
operator is actually taking longer than bincount
. Haven't tried removing nonzero
while keeping bincount
, but I tried some alternate approaches without (bincount
and nonzero
). None were faster. I think this requires a RawKernel because alternate approaches require more Python loops whose overhead is probably overcoming any advantage from avoiding atomic operations.
@nikitinvv, Please check the docstrings then merge this PR if everything makes sense. I was able to speed up the USFFT with RawKernel, but I'm going to open a new PR for that.
Thanks, @nikitinvv! :fireworks:
Purpose
In this PR, I am adding laminography operators, so as the corresponding adjoint test, cg solver, and tests for consistency in a similar way it was done with the ptycho solver.
Approach
The forward and adjoint laminography operators are implemented by using 3D unequally-spaced fast Fourier transforms (USFFT) where each operator assumes numpy or cupy backend given with by the env variable TIKE_BACKEND. No cuda raw kernels are used so far.
General
Computing laminography operators involves 2 main operations: 3D USFFT and 2D IFFT, e.g. the forward operator is computed as follows:
Issue
Currently, the adjoint operator is very slow because it involves computing USFFT from unequally-spaced to equally-spaced grid, which is based on computing the 'scattering' operation: values on the unequally-spaced grid with different weights are added to neighboring points of the equally-spaced grid. This is not directly parallel process and needs atomic operations. In the current code, this operation is done by using numpy
bincount
function (seesrc/tike/operators/lamino/usfft.py
function us2eq):All computing time is spent for numpy
bincount
function. So it should be accelerated in numpy/cupy regime. Another way is to write corresponding cuda raw kernel for the scattering operation, that will involve AtomicAdd operation from Cuda C.Pre-Merge Checklists
Submitter
yapf
to format python code.Reviewer