avoid memory allocation in fourier.dft2()

shantirao commented 4 years ago

In which we go to extraordinary lengths to avoid allocating memory

The intermediate working matrices are allocated and cached, as repeated calls tend to have the same size of the input and output matrices. Some of these can even be partially precomputed, such as the X, Y, U, V arrays, except for the phase shifts sx and sy). These are small enough that allocating them dynamically doesn't seem to save much time.

An out parameter is added. If this is None, then the dft2() function allocates a new matrix as needed. But if it's a slice into a 3D NumPy matrix (of type complex128) or a 2D NumPy matrix, that destination will be passed to the matrix operations, eliminating allocations of memory for intermediate operations.

This will have the most effect on speed when calling fourier.dft2() over and over with the same size inputs and outputs.

andykee commented 4 years ago

I would expect that since the new dft2 out parameter mirrors both the name and intended usage of Numpy's out parameter, it should behave the same way:

>>> import numpy as np
>>> a = np.random.uniform((10,10))
>>> b = np.multiply(a, 5, out=a)
>>> b is a

True

The Lentil version doesn't get quite so far (unit test in 28acaf2):

>>> import numpy as np
>>> import lentil as le
>>> f = np.random.uniform((10,10))
>>> F = le.fourier.dft2(f, 1/10, out=f)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-486269a1b018> in <module>
----> 1 le.fourier.dft2(f, 1/10, out=f)

~/Dev/lentil/lentil/fourier.py in dft2(f, alpha, npix, shift, unitary, out)
    161 
    162         np.dot(E1,f,out = Mbyn)
--> 163         F = np.dot(Mbyn,E2,out = out)
    164 
    165     # now calculate the answer, without allocating memory

<__array_function__ internals> in dot(*args, **kwargs)

ValueError: output array is not acceptable (must have the right datatype, number of dimensions, and be a C-Array)

shantirao commented 4 years ago

The preallocated output has to be a complex64

andykee commented 4 years ago

I reworked the caching approach a bit to make it slightly safer and 3x more performant. It's now a two stage deal where the coordinate vectors X, Y, U, and V are cached (this catches most of the cases) and then E1 and E2 are cached (this is useful for when the f input varies but shift is the same, as is the case with segmented models).

I also included a check to ensure the result can be cast into the out array if one is provided.

andykee / lentil

avoid memory allocation in fourier.dft2() #13