AdvancedPhotonSource / tike

Repository for ptychography software
http://tike.readthedocs.io
Other
29 stars 15 forks source link

NEW: Add Laminography operators #62

Closed nikitinvv closed 4 years ago

nikitinvv commented 4 years ago

Purpose

In this PR, I am adding laminography operators, so as the corresponding adjoint test, cg solver, and tests for consistency in a similar way it was done with the ptycho solver.

Approach

The forward and adjoint laminography operators are implemented by using 3D unequally-spaced fast Fourier transforms (USFFT) where each operator assumes numpy or cupy backend given with by the env variable TIKE_BACKEND. No cuda raw kernels are used so far.

General

Computing laminography operators involves 2 main operations: 3D USFFT and 2D IFFT, e.g. the forward operator is computed as follows:

def fwd(self, u, **kwargs):
        """Perform the forward Laminography transform."""
        # USFFT from equally-spaced grid to unequally-spaced grid
        F = eq2us(u, self.xi, self.n, self.eps, self.xp).reshape(
            [self.ntheta, self.n, self.n])

        # Inverse 2D FFT
        data = self.xp.fft.fftshift(self.xp.fft.ifft2(self.xp.fft.fftshift(
            F, axes=(1, 2)), axes=(1, 2), norm="ortho"), axes=(1, 2))
        return data

Issue

Currently, the adjoint operator is very slow because it involves computing USFFT from unequally-spaced to equally-spaced grid, which is based on computing the 'scattering' operation: values on the unequally-spaced grid with different weights are added to neighboring points of the equally-spaced grid. This is not directly parallel process and needs atomic operations. In the current code, this operation is done by using numpy bincount function (see src/tike/operators/lamino/usfft.py function us2eq):

for i0 in range(2*m):
        delta0 = (ell0-m+i0).astype('float32')/(2*n)-x[:, 0]
        for i1 in range(2*m):
            delta1 = (ell1-m+i1).astype('float32')/(2*n)-x[:, 1]
            for i2 in range(2*m):
                delta2 = (ell2-m+i2).astype('float32')/(2*n)-x[:, 2]

                kera = cons[0]*xp.exp(-cons[1]*(delta0**2+delta1**2+delta2**2))
                ids = n+ell2+i2+(2*n+2*m)*(n+ell1+i1) + \
                    (2*n+2*m)*(2*n+2*m)*(n+ell0+i0)
                vals = f*kera
                # accumulate by indexes (with possible index intersections),
                # TODO acceleration of bincount!!
                vals = xp.bincount(ids, weights=vals.real) + \
                    1j*xp.bincount(ids, weights=vals.imag)
                ids = xp.nonzero(vals)[0]
                G[ids] += vals[ids]

All computing time is spent for numpy bincount function. So it should be accelerated in numpy/cupy regime. Another way is to write corresponding cuda raw kernel for the scattering operation, that will involve AtomicAdd operation from Cuda C.

Pre-Merge Checklists

Submitter

Reviewer

pep8speaks commented 4 years ago

Hello @nikitinvv! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! :beers:

Comment last updated at 2020-05-16 03:34:24 UTC
carterbox commented 4 years ago

Currently, the adjoint operator is very slow because it involves computing USFFT from unequally-spaced to equally-spaced grid, which is based on computing the 'scattering' operation...

One of the issues that you raise is that atomic operations during the adjoint operation are expensive. However, every scattering operation can be expressed as a gathering operation. i.e. instead of scattering point to grid-spaces, gather grid-spaces from points. In the case where the number of grid-space and points are the same, I think this should have the same computational complexity without atomic operations.

carterbox commented 4 years ago
Use pyinstrument to benchmark the conjugate gradient algorithm. ... 

  _     ._   __/__   _ _  _  _ _/_   Recorded: 14:53:53  Samples:  97110
 /_//_/// /_\ / //_// / //_'/ //     Duration: 100.870   CPU time: 100.826
/   _/                      v3.1.3

Program: profile_lamino.py

100.870 template_algorithm  profile_lamino.py:36
└─ 100.870 reconstruct  tike/lamino/lamino.py:88
   └─ 100.276 lamcg  tike/lamino/solvers/lamcg.py:8
      └─ 100.276 update_obj  tike/lamino/solvers/lamcg.py:21
         └─ 100.276 conjugate_gradient  tike/opt.py:79
            ├─ 63.251 grad  tike/lamino/solvers/lamcg.py:26
            │  └─ 63.251 grad  tike/operators/numpy/lamino.py:62
            │     ├─ 44.352 adj  tike/operators/numpy/lamino.py:49
            │     │  └─ 44.300 us2eq  tike/operators/numpy/usfft.py:75
            │     │     ├─ 16.849 nonzero  cupy/sorting/search.py:120
            │     │     │     [11 frames hidden]  cupy, <__array_function__ internals>,...
            │     │     ├─ 11.385 [self]  
            │     │     ├─ 10.067 bincount  cupy/statistics/histogram.py:100
            │     │     │     [18 frames hidden]  cupy, <__array_function__ internals>,...
            │     │     ├─ 3.165 can_cast  <__array_function__ internals>:2
            │     │     │     [4 frames hidden]  <__array_function__ internals>, numpy
            │     │     └─ 2.489 <lambda>  tike/operators/numpy/usfft.py:103
            │     │        └─ 1.534 [self]  
            │     └─ 18.894 fwd  tike/operators/numpy/lamino.py:38
            │        └─ 18.832 eq2us  tike/operators/numpy/usfft.py:10
            │           ├─ 9.336 [self]  
            │           ├─ 5.797 <lambda>  tike/operators/numpy/usfft.py:42
            │           │  ├─ 3.829 [self]  
            │           │  └─ 1.959 can_cast  <__array_function__ internals>:2
            │           │        [4 frames hidden]  <__array_function__ internals>, numpy
            │           └─ 3.129 can_cast  <__array_function__ internals>:2
            │                 [4 frames hidden]  <__array_function__ internals>, numpy
            └─ 37.000 line_search  tike/opt.py:16
               └─ 36.989 cost_function  tike/lamino/solvers/lamcg.py:23
                  └─ 36.989 cost  tike/operators/numpy/lamino.py:58
                     └─ 36.975 fwd  tike/operators/numpy/lamino.py:38
                        └─ 36.849 eq2us  tike/operators/numpy/usfft.py:10
                           ├─ 18.213 [self]  
                           ├─ 11.718 <lambda>  tike/operators/numpy/usfft.py:42
                           │  ├─ 7.818 [self]  
                           │  └─ 3.899 can_cast  <__array_function__ internals>:2
                           │        [4 frames hidden]  <__array_function__ internals>, numpy
                           └─ 6.394 can_cast  <__array_function__ internals>:2
                                 [4 frames hidden]  <__array_function__ internals>, numpy

ok

----------------------------------------------------------------------
Ran 1 test in 106.832s

It looks like the nonzero operator is actually taking longer than bincount. Haven't tried removing nonzero while keeping bincount, but I tried some alternate approaches without (bincount and nonzero). None were faster. I think this requires a RawKernel because alternate approaches require more Python loops whose overhead is probably overcoming any advantage from avoiding atomic operations.

carterbox commented 4 years ago

@nikitinvv, Please check the docstrings then merge this PR if everything makes sense. I was able to speed up the USFFT with RawKernel, but I'm going to open a new PR for that.

carterbox commented 4 years ago

Thanks, @nikitinvv! :fireworks: