Performance degradation between 0.2.2, 0.3.0, and 0.4.0

zstomp commented 8 years ago

I've been using Mike Marino's ISWT routine with PyWavelets 0.2.2 for quite a while now. The function made it to PyWavelets 0.4.0, but its performance is significantly slower than with 0.2.2.

I extracted the iswt function from 0.4.0 and ran a test (performance.py) with 0.2.2 (7 s), 0.3.0 (16 s), and 0.4.0 (22 s). I was able to profile 0.4.0, and the problem seems to be np.iscomplexobj() call in idwt routine. Profiling 0.3.0 failed for me, for some reason the numbers get close to 0.2.2 under profiler.

Test config: VirtualBox running Ubuntu 14.04; Python 2.7.11 and Numpy 1.10.4 running in an Anaconda 3.19.1 environment.

kwohlfahrt commented 8 years ago

Can confirm, 4.3s, 11s and 16s with 0.2.2, 0.3.0 and 0.4.0 respectively on my laptop (i5-4210U), numpy-1.10.4, python-2.7.11. python 3.4.3 has times of 14s, and 19s, for 0.3.0 and 0.4.0, doesn't work with 0.2.2.

FWIW I'm working on #136 so this should be rarer in future, but I'll see if I can figure this one out too.

zstomp commented 7 years ago

FYI 0.5.1 has performance of 0.3.0, i. e. 2.5 times worse than 0.2.2

grlee77 commented 7 years ago

I can confirm that 0.5.1 is the same as 0.3.0 in performance (both are better than 0.4.0). I think that is what we expect from #162, but still does not match 0.2.2 which is why the issue here hasn't been closed.

Python  3.5:
    release 0.3.0:  11.5 s
    release 0.4.0:  14.6 s
    release 0.5.1:  11.5 s
    my C99 complex branch: 9.8 s

Python 2.7:
    release 0.2.2:  5.2 s
    release 0.5.1:  11.7 s

I have a branch at https://github.com/grlee77/pywt/tree/cplx_cwarn_fixed that handles complex dtypes by using C99 complex on the C-side, so that the np.iscomplexobj checks are removed entirely. I haven't made a PR for that here because I don't think it can be compiled with MSVC (no C99 complex support).

I did test on that branch and performance on my laptop (Python 3.5 on OS X) improved about 20% from 11.5 s for release 0.5.1 to 9.8 s. Better, but still a ways from the Python 2 / v0.2.2 case, so complex support cannot be the primary culprit.

grlee77 commented 7 years ago

I should also note that for a larger problem size: i.e. changing size from 256 to 65536 and reducing repetitions from 1000 to 10 in the performance.py above I get the same relative performance difference for ISWT (although the forward SWT is 7-8 times faster in 0.5.1 at this size due to a more efficient implementation)

grlee77 commented 7 years ago

Okay, I have found a simple solution that restores very close to the 0.2.2 performance for ISWT. This involves calling the idwt_axis Cython routine instead of the idwt python routine from within iswt. That gets the time down to 6 s for me. I will make a PR for that.

Also, if you can process in a batch mode (i.e. stack multiple 1D transforms into a 2D array and then call swt with the axis argument that will be MUCH faster than running each 1D transform individually. Unfortunately, iswt doesn't have axis support at the moment, but that would be pretty simple to add.

grlee77 commented 7 years ago

@zstomp please try out PR #255 and see if that fixes things for you

zstomp commented 7 years ago

@grlee77 I noticed that iswt in my original performance.py used 'per' extension mode, and changing it to 'periodization' speeded up 0.5.1 by almost 40% (who would have thought?). Another, minor, improvement is if not isinstance(wavelet, Wavelet): wavelet = Wavelet(wavelet), although it doesn't address the original issue. idwt_axis didn't help at all.

To summarize, 0.5.1 with correct iswt is 10 s which is better than 0.3.0 (16 s), but still worse than 0.2.2 (7 s).

zstomp commented 7 years ago

@grlee77 idwt_single instead of idwt_axis makes it even faster than 0.2.2

grlee77 commented 7 years ago

switching to idwt_single was faster for me too (4.45 s for me). In that case, you need to make a copy of the coefficient arrays after the odd/even indexing so that they are contiguous when input to idwt_single as that routine cannot work for non-contiguous inputs. idwt_axis handles that copying internally whenever the input isn't contiguous, but apparently not quite as efficiently.

I will update the PR and credit you in the commit message

zstomp commented 7 years ago

@grlee77 I didn't realize there was a catch, since my higher order test passed. Now I've made a test comparing raw outputs of iswt with idwt and idwt_single without any copying, and they are equal. Am I creating more confusion than help?

grlee77 commented 7 years ago

hmmm... it looks like the input to idwt_single is specified as a contiguous memoryview. Maybe Cython automatically converts a non-contiguous input to contiguous for you. I will check up on that when I get a chance

dwt_single(data_t[::1] data, Wavelet wavelet, MODE mode)

grlee77 commented 7 years ago

nevermind the previous comment. was looking at dwt_single instead of idwt_single. for idwt_single it is:

cpdef idwt_single(np.ndarray cA, np.ndarray cD, Wavelet wavelet, MODE mode):

but then the C routines are passed e.g. <double *>cA.data without any strides info which would seem to imply the coefficients must be contiguous.

grlee77 commented 7 years ago

all tests do pass for me without the copies and as expected it is even faster in that case. still need to take another look later on why omitting the copy seems to be okay.

grlee77 commented 7 years ago

Okay. I understand why the copy is not needed now. This relies on the way numpy indexing via arrays is done. I was thinking of the scenario as b1 in the example below which is not contiguous, but the code indexing with a numpy array of ints, not a slice, so this already creates a contiguous copy.

e.g.

a = np.arange(16)
b1 = a[::2]   # non-contiguous
b2 = a[np.arange(0, 16, 2)]  # contiguous

so as long as nobody unwitting changes the way the indexing is done it is safe to not make a redundant copy. For safety & clarity, I think I will call np.ascontiguousarray() to make it clear that contiguous input is expected in idwt_single. This will not create a copy in the case where the array is already contiguous and seems to still give the performance boost.

grlee77 commented 7 years ago

I went ahead and just put a comment noting that the indexed array will be a contiguous copy to avoid any overhead from calls to np.ascontiguousarray(). Performance is now at 4 s on my system which is better than the 0.2.2 case.

Am I creating more confusion than help?

no. constructive feedback is greatly appreciated!

zstomp commented 7 years ago

It all makes sense now, thanks a lot!

rgommers commented 7 years ago

Can this be closed now, or is there anything left to do?

grlee77 commented 7 years ago

closed by #255. thanks for reviewing it, @rgommers

PyWavelets / pywt

Performance degradation between 0.2.2, 0.3.0, and 0.4.0 #157