OverLordGoldDragon / ssqueezepy

Synchrosqueezing, wavelet transforms, and time-frequency analysis in Python
MIT License
599 stars 92 forks source link

GPU performance questions, `maprange` #78

Closed journeytosilius closed 1 year ago

journeytosilius commented 1 year ago

Hi, it's me again

Wanted to know if there is some example of maprange around so I can do some quick tests and study it from there ?

On my tests, I have observed that if a sudden peak of energy while on a window, is much higher than the already "detected" clusters of energy, the first clusters detected of lower energy are marginalized, in the sense that they become dark ( they lose the color they had )

What is a practical way, if there is any, to tune the scales so this effect does not happen or at least it's less obvious ?

Thank you very much for this wonderful library

OverLordGoldDragon commented 1 year ago

I'll note that any other maprange is a subset of 'maximal', hence 'maximal' contains the most information. While the goal is to tile every frequency, unfortunately 'maximal' doesn't always accomplish that, so one should play around with code and visuals in examples/scales_selection.py to be sure. Glad you like the library.

OverLordGoldDragon commented 1 year ago

Actually I confused maprange='maximal' with scales='log:maximal'. maprange shouldn't be touched for most purposes, it controls reassignment locations onto SSQ from CWT. Relevant article.

journeytosilius commented 1 year ago

thank you for pointing out, it's very helpful. I have realized that the window size matters. Basically you need to have the maxima of the frequency range located / contained on the window you are working with, and then the frequency range becomes properly distributed. On some applications this might not be possible, but for what I'm doing right now it works well.

I have another question, if you don't mind:

10k-ssq_cwt | 0.372 | 0.148 | 0.00941 | 

by using cache_wavelet=True I have achieved constant 0.2 processing time per run on an increasing size array that starts at 10 and keeps on growing, I have tested to up to 5K samples. But I can't achieve this number 0.00941 on an RTX 2080 SUPER. Can you describe what other method you use, besides caching the wavelet to achieve such speeds ?

Thank you very much

OverLordGoldDragon commented 1 year ago

Many things influence performance. Have you tried (and studied) the benchmark code? What comes to mind: 1) passing an array that's already on GPU, 2) excluding "warmups" in benchmarks, i.e. running a few times first and then timing so that caching (beyond ssqueezepy) isn't counted.

journeytosilius commented 1 year ago

Thank you for answering, and I was completely unaware about 1) passing an array that's already on GPU, since I'm new to Torch. This makes sense then, now I need to study the benchmark code and discover how to continuously feed the array in the GPU instead of passing a new one, I presume ! Thanks

OverLordGoldDragon commented 1 year ago

That's not a flaw in your benchmarking, rather I chose to report a different metric. For some applications the CPU->GPU overhead is important and must be accounted for, but it's not the case most of the time and other benchmarks tend to do same.

journeytosilius commented 1 year ago

just wanted to point out that the problem was a mismatch on Cuda toolkit. I had installed 11.7 and pytorch was 11.6. Even tho it did not work correctly, it actually partially worked somehow ( or that's what it looked like ) since tensors were being produced with GPU acceleration selected. I have now aligned the Cuda Toolkit versions and I do get these numbers published on the benchmarks. Thank you for your help

OverLordGoldDragon commented 1 year ago

That makes more sense since the overhead for such a small input should be negligible.

OverLordGoldDragon commented 1 year ago

In fact it's always negligible for ssqueezepy relative to the transform times, with possible exception of moving the outputs back to CPU (which are >> x.size), so the benchmarks are even more general.

journeytosilius commented 1 year ago

Yes, I just realized that setting astensor=False increases for about 3-4x the timings, but it's still much better now since I can keep it under 50ms per iteration with 20K samples :)

OverLordGoldDragon commented 1 year ago

Note if you only need Tx it's better to =True and then manually move Tx back to CPU and free Wx from GPU's memory. Latter should happen automatically with something like Tx = ssq_cwt(x, wavelet)[0].cpu().numpy(). For processing large datasets there's libraries (e.g. joblib) that allow doing GPU->CPU in parallel with GPU ops, effectively nulling the overhead.

journeytosilius commented 1 year ago

thanks for the info ! I assume you mean np_arr = Wx.cpu().detach().numpy() this to move to CPU, but how do we free Wx from the GPU memory ?