Closed zstomp closed 7 years ago
Can confirm, 4.3s, 11s and 16s with 0.2.2, 0.3.0 and 0.4.0 respectively on my laptop (i5-4210U), numpy-1.10.4, python-2.7.11. python 3.4.3 has times of 14s, and 19s, for 0.3.0 and 0.4.0, doesn't work with 0.2.2.
FWIW I'm working on #136 so this should be rarer in future, but I'll see if I can figure this one out too.
FYI 0.5.1 has performance of 0.3.0, i. e. 2.5 times worse than 0.2.2
I can confirm that 0.5.1 is the same as 0.3.0 in performance (both are better than 0.4.0). I think that is what we expect from #162, but still does not match 0.2.2 which is why the issue here hasn't been closed.
Python 3.5:
release 0.3.0: 11.5 s
release 0.4.0: 14.6 s
release 0.5.1: 11.5 s
my C99 complex branch: 9.8 s
Python 2.7:
release 0.2.2: 5.2 s
release 0.5.1: 11.7 s
I have a branch at https://github.com/grlee77/pywt/tree/cplx_cwarn_fixed that handles complex dtypes by using C99 complex on the C-side, so that the np.iscomplexobj
checks are removed entirely. I haven't made a PR for that here because I don't think it can be compiled with MSVC (no C99 complex support).
I did test on that branch and performance on my laptop (Python 3.5 on OS X) improved about 20% from 11.5 s for release 0.5.1 to 9.8 s. Better, but still a ways from the Python 2 / v0.2.2 case, so complex support cannot be the primary culprit.
I should also note that for a larger problem size: i.e. changing size
from 256 to 65536 and reducing repetitions from 1000 to 10 in the performance.py
above I get the same relative performance difference for ISWT (although the forward SWT is 7-8 times faster in 0.5.1 at this size due to a more efficient implementation)
Okay, I have found a simple solution that restores very close to the 0.2.2 performance for ISWT. This involves calling the idwt_axis
Cython routine instead of the idwt
python routine from within iswt
. That gets the time down to 6 s for me. I will make a PR for that.
Also, if you can process in a batch mode (i.e. stack multiple 1D transforms into a 2D array and then call swt
with the axis argument that will be MUCH faster than running each 1D transform individually. Unfortunately, iswt
doesn't have axis support at the moment, but that would be pretty simple to add.
@zstomp please try out PR #255 and see if that fixes things for you
@grlee77
I noticed that iswt
in my original performance.py
used 'per'
extension mode, and changing it to 'periodization'
speeded up 0.5.1 by almost 40% (who would have thought?). Another, minor, improvement is if not isinstance(wavelet, Wavelet): wavelet = Wavelet(wavelet)
, although it doesn't address the original issue. idwt_axis
didn't help at all.
To summarize, 0.5.1 with correct iswt
is 10 s which is better than 0.3.0 (16 s), but still worse than 0.2.2 (7 s).
@grlee77 idwt_single
instead of idwt_axis
makes it even faster than 0.2.2
switching to idwt_single
was faster for me too (4.45 s for me). In that case, you need to make a copy of the coefficient arrays after the odd/even indexing so that they are contiguous when input to idwt_single as that routine cannot work for non-contiguous inputs. idwt_axis handles that copying internally whenever the input isn't contiguous, but apparently not quite as efficiently.
I will update the PR and credit you in the commit message
@grlee77 I didn't realize there was a catch, since my higher order test passed. Now I've made a test comparing raw outputs of iswt
with idwt
and idwt_single
without any copying, and they are equal. Am I creating more confusion than help?
hmmm... it looks like the input to idwt_single
is specified as a contiguous memoryview. Maybe Cython automatically converts a non-contiguous input to contiguous for you. I will check up on that when I get a chance
dwt_single(data_t[::1] data, Wavelet wavelet, MODE mode)
nevermind the previous comment. was looking at dwt_single instead of idwt_single. for idwt_single it is:
cpdef idwt_single(np.ndarray cA, np.ndarray cD, Wavelet wavelet, MODE mode):
but then the C routines are passed e.g. <double *>cA.data
without any strides info which would seem to imply the coefficients must be contiguous.
all tests do pass for me without the copies and as expected it is even faster in that case. still need to take another look later on why omitting the copy seems to be okay.
Okay. I understand why the copy is not needed now. This relies on the way numpy indexing via arrays is done. I was thinking of the scenario as b1
in the example below which is not contiguous, but the code indexing with a numpy array of ints, not a slice, so this already creates a contiguous copy.
e.g.
a = np.arange(16)
b1 = a[::2] # non-contiguous
b2 = a[np.arange(0, 16, 2)] # contiguous
so as long as nobody unwitting changes the way the indexing is done it is safe to not make a redundant copy. For safety & clarity, I think I will call np.ascontiguousarray() to make it clear that contiguous input is expected in idwt_single
. This will not create a copy in the case where the array is already contiguous and seems to still give the performance boost.
I went ahead and just put a comment noting that the indexed array will be a contiguous copy to avoid any overhead from calls to np.ascontiguousarray(). Performance is now at 4 s on my system which is better than the 0.2.2 case.
Am I creating more confusion than help?
no. constructive feedback is greatly appreciated!
It all makes sense now, thanks a lot!
Can this be closed now, or is there anything left to do?
closed by #255. thanks for reviewing it, @rgommers
I've been using Mike Marino's ISWT routine with PyWavelets 0.2.2 for quite a while now. The function made it to PyWavelets 0.4.0, but its performance is significantly slower than with 0.2.2.
I extracted the
iswt
function from 0.4.0 and ran a test (performance.py) with 0.2.2 (7 s), 0.3.0 (16 s), and 0.4.0 (22 s). I was able to profile 0.4.0, and the problem seems to benp.iscomplexobj()
call inidwt
routine. Profiling 0.3.0 failed for me, for some reason the numbers get close to 0.2.2 under profiler.Test config: VirtualBox running Ubuntu 14.04; Python 2.7.11 and Numpy 1.10.4 running in an Anaconda 3.19.1 environment.