[MNT] - Optimizations for curve fit

TomDonoghue commented 1 year ago

This is a series of small tweaks to the fit algorithm that should result in minor improvements in run speed.

Updates include:

a slightly more efficient stepping across params in gaussian function
removing unneeding 'ys' initialization in aperiodic fit functions
setting check_finite to False in curve_fit (since we do our own data checking)
expose curve_fit tolerance settings as a private setting, and set new default value that is a bit more liberal
provide precomputed Jacobian matrix, for the Guassian curve_fit call*

*I did add the Jacobians for the AP functions as well, but there was more mixed results whether this provided any speedup - notably these calls are already pretty fast, and so based on this post, I think for AP funcs we are often in a case in which the checks on the passed in Jacobian supercede any benefit as compared to computing it. Because of this, I did not actually set the Jacobian functions to be used in the aperiodic component curve_fit calls.

Results !?

Without being particularly rigorous about profiling, pretty quick / simple tests suggest this update provides something in the range of at least 5-20% speed up even in the simplest fitting cases (which is what I was expecting), but it has an unexpectedly outsized effects on more complex cases - and so in cases with knees and/or more peaks, noise etc - I'm seeing what looks like 2-3X speed increases! I didn't profile in too much detail exactly where this comes from, but I think a bit part is that when we start having multiple Gaussians estimating the Jacobian is slow, but it's actually quick and easy to pass in, and also the tolerance shift can have a notable impact (the other changes are less impactful).

In terms of outputs everything other than the tolerance levels change should have zero effect (meaning should get numerically exactly the same results out). The tolerance change can (though does not necessarily) change results - I've only seen it change beyond what should be notable resolution (eg. changing a peak bandwidth value by ~0.005 or so). If it's indeed limited to the this magnitude of impact, I think it's fine / worth the trade off. Note that the actual value I set (0.00001, replacing the otherwise default of 1e-8) is arbitrary - the higher this gets, more speed increase, but more chance of having more notably impacts. Also, this was added as a private setting, so one can also set it back to match previous default (fm._tol = 1e-8) to return to exactly the same as 1.0, or set a different value as needed.

Notes

I followed some pointers from this stack overflow post to try and optimize things.
The formulas for the Jacobian matrix calculations all come from Wolfram Alpha. As far as I can see they provide numerically exact answers to not passing in the jacobian and it being estimated, so the outputs are exactly the same.

Reviewing

There's not too much changed in the code to check - what would be good is if someone (@ryanhammonds perhaps) could do a quick check to replicate that a) this branch is at least a bit quicker than main, and b) you see no real change in output parameters. I'm also tagging @rdgao because you might be interested in these changes, and/or have some more knowledgeable insight over and above my heuristic updates here.

Addendum: what I learned about `curve_fit`

Doing some quick profiling, we spend basically all of our time in curve_fit, meaning it's the only real target for meaningful optimization, and within that, we spend the vast majority of our time doing the Gaussian fit (not really the guess procedure, but the curve_fit call itself). Why here? Well, it turns out with no bounds, curve_fit estimation method ends up as the Levenberg-Marquardt algorithm which is very fast - but does not support bounds. For peaks, we need bounds - adding those pushes the method to default to the Trust Region Reflective ('trf') which is way slower. Notably, there is another option for bounded estimations ('dogbox') - which is sometimes a bit faster, but not consistently enough to try and change it (both methods give the exact same results as far as I can see).

Implications of this:

adding bounds, in general, slows things down a lot. We can't avoid this for peaks. In general, we do not provide bounds for AP, and this is (somewhat counter-intuitively) much quicker than if we did, which is useful to know.
we spend almost all our fitting time within the curve_fit function, with no other meaningful targets for optimization. This PR does what I could figure out to optimize how we call curve_fit - but my sense beyond this is that we are hitting a limit of what we can speed up without looking for different / faster implementations of the fitting procedure itself (in particular with bounds).

ryanhammonds commented 1 year ago

This is awesome! Really cool to see nice optimizations here. In terms of:

a) this branch is at least a bit quicker than main, and b) you see no real change in output parameters

It may be nice to agree on a set of simulations (I attempted something similar here) that can be used to benchmark and validate performance of various optimizations, so we have some objective measure of choosing one setting over another. I can rerun these simulations to see how this PR compares to main, or we can update/improve the set of simulations first (maybe something like random uniforms from [lower_bound, upper_bound] for each param?).

For bounded optimization, I've had some luck using scipy's LBFGSB. I wonder how it compares to TRF.

TomDonoghue commented 1 year ago

@ryanhammonds - your work with Rust and comparisons are what got me thinking more about why we are so slow and trying this out!

In terms of having a basic set of comparisons for profiling, I totally agree! I've added a Github gist with the basic checks I was using here: https://gist.github.com/TomDonoghue/44d09697fe6b7f9a5b2eadf5fb695148

Basic idea is to check {fixed, knee} fits across {one-peak, multi-peak, high noise} examples. Having cleaned this notebook up a bit, the one / multi peak cases have what seem like useful but fairly modest improvements (order of 5-50% speed up), where as the high noise can be several times faster - this is, I think, because without setting a peak bound we can have many peaks, and estimating the Jacobian for such a large function gets really slow (note that this probably exaggerates the benefit in real cases, since in real cases one would want to bound the number of peaks anyways).

What you have for comparing is more detailed / better than my super simple things - so perhaps we should go with something more like that!

voytek commented 1 year ago

First of all this is great work, Tom!

But ugh what I think this means is that we should not only do what Ryan suggests of having some good simulations for benchmarking, but we should take it even further:

Create a set of simulations for benchmarking
Note all of the decision points in our curve-fitting (bounds, minimization methods, etc.)
Do a full parameter search across all approaches to see which sets of parameters give the fastest fit times under real conditions without compromising fit accuracy

Maybe not for this implementation, but if you two agree, please convert this comment to an issue and I'll save it for when we hire a full-time dev (hopefully this year!).

TomDonoghue commented 1 year ago

@voytek & @ryanhammonds - I do agree with the broader points here, though I'd like to suggest we don't miss the trees for the forest in terms of this PR as compared to bigger & broader plans :).

For this PR, if we can move forward with fairly standard code review, as long as we are confident that it offers some speedup, and is consistently accurate (the only more notably difference is everyone being on board with the tolerance changes noted above), then I hope we can merge this in, and I can merge it through to 2.0 and continue updates there.

For the broader perspective, I think much if not all of this is best organized around 2.0, since various things will change by then anyways (for example, benchmarking the broader set of fit functions, rather than the limited set in 1.X, and using the updated organization in relation to fully describing algorithm decision points). I think these two things (common benchmarking & more detailed algorithm description) are within scope to happen as we work towards 2.0. In terms of Brad's third point (full audit + explore new approaches), and Ryan's comment about LBFGSB then this is something we can keep in mind and consider after 2.0 if/when someone has time (and can describe in a new issue).

Does this all sound good? If so, the practical outcomes are to check the details of this PR, I'll keep track of some notes and plans here for anything at the 2.0 level, and we can open issues for longer term plans.

voytek commented 1 year ago

Oh for sure. I'm happy with this as is now, I just think we should—for 2.0—actually implement some of the above suggestions.

TomDonoghue commented 1 year ago

Thanks for the deep dive, and making things even faster @ryanhammonds! In terms of the refactoring, it looks greats - and I left replies to the comments above!

In terms of the timeit quirks - that's a great catch, and I have basically no idea what's going on here lol. My first thought was that maybe curve_fit specifically does some kind of caching (?). To poke at this, using %%timeit -n X -r X and exploring iterations/repeats of 1 or more with other functions (numpy / standard library), as far as I can tell every timed code block is a lot slower for the first iteration? I'm not sure if this is a timeit thing or a Jupyter thing, and it's also not clear how a 1 second sleep would reset things... (I also timed a 1 second time.sleep call, and it has +/- 1-3 ms variance, which is annoying). Despite this seeming like a pretty general property / quirk, I couldn't find anything useful about it. All in all, I dno what's happening and I don't know what to believe lol (though I would perhaps guess that the first iteration is perhaps an overestimate than subsequent iterations necessarily being underestimates...?).

FWIW - I was using %lprun as a diagnostic check for some of the work here (eg %lprun -f FOOOF.fit fm.fit(freqs, powers)), which times and profiles a single iteration giving time / percent per internal call. Though I didn't check total times extensively, using this is consistent in showing speed-ups in it's total time estimation, though this number seems to be a bit different from the timeit values (though perhaps because of profiling overhead).

That aside, I think regardless of some fuzziness of exactly which timeit reports to exactly believe, in relative terms, it's consistent that the current version with Jacobians is faster than main, and your update of the Jacobians is faster than what's currently on this branch, right? If so, I think we can move forward with this PR, and just accept that profiling is confusing and difficult lol.

codecov-commenter commented 1 year ago

Codecov Report

Merging #299 (aa64fe3) into main (d63aae0) will decrease coverage by 0.06%. The diff coverage is 100.00%.

:exclamation: Current head aa64fe3 differs from pull request most recent head 965f2d1. Consider uploading reports for the commit 965f2d1 to get more accurate results

:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.

@@            Coverage Diff             @@
##             main     #299      +/-   ##
==========================================
- Coverage   97.63%   97.58%   -0.06%     
==========================================
  Files          86       88       +2     
  Lines        3890     3933      +43     
==========================================
+ Hits         3798     3838      +40     
- Misses         92       95       +3

Files Changed	Coverage Δ
fooof/core/funcs.py	`100.00% <100.00%> (ø)`
fooof/core/jacobians.py	`100.00% <100.00%> (ø)`
fooof/objs/fit.py	`93.87% <100.00%> (-0.74%)`	:arrow_down:
fooof/tests/core/test_jacobians.py	`100.00% <100.00%> (ø)`
fooof/tests/objs/test_fit.py	`100.00% <100.00%> (ø)`

ryanhammonds commented 1 year ago

I think the reason why it's slow for the first iteration is due to the use of memoization to reduce redundant computation, particularly at the guess params and bounds since the curve needs to evaluated at those points every time curve_fit is called with the same settings and it seems stores those results until garbage collection frees it (or maybe until the minpack or openblas process drops). Adding noise to guess and bounds removes the effect. This may be a consideration for future benchmarks since related reductions in runtime may not be present with real data if guess and bounds are expected to change.

Everything else looks good to go here though!

rdgao commented 1 year ago

had a quick scan, I have nothing insightful to add, I didn't even know you could speed up curve_fit by providing it with the Jacobian, nor the different optimization algos under the hood (dogbox is a great name though). But all this makes sense to me post-hoc.

Somewhat unrelated so not sure if this is the best place to discuss this, but since we're on the topic of many peaks slowing things down: I think the default max_n_peaks can be much smaller, like 3. Having talked to some people about the nitty gritty of hyperparameter choices recently, it seems like people sometimes set max_n_peaks to be much bigger than reasonable (and default is inf?).

With a lower default, the tangential benefit would be a speedup in the common use cases when less resources are dedicated to fitting ghost peaks (which of course wouldn't manifest in the speed tests here when actually setting max_n_peaks), but might lead to "better usage" of FOOOF overall.

voytek commented 1 year ago

@rdgao's comment goes back to my data-driven refactoring idea: let's just try all combos and choose the fastest among the subset that returns the correct fit.

But anyway, I agree about max_n_peaks being a sticking point. I understand the desire to force a default, but that's motivated by our a priori knowledge about neural data. Now that other fields (nuclear fusion, materials science) are adapting specparam for their needs, forcing n=3 or whatever probably doesn't make sense.

TomDonoghue commented 1 year ago

In terms of the code updates here - following @ryanhammonds work on optimizing the Gaussian Jacobian, I updated the implementation of the aperiodic function Jacobians - they do get ~ twice as fast, but as far as I can tell, using them still does not seem to give any speedup (curve_fit must be able to estimate these very well), so I still have not set them to be used (but they are available in the module).

In terms of @rdgao's comment on tweaking default parameters, I've opened an issue for discussion of any default parameter changes (https://github.com/fooof-tools/fooof/issues/303). My suggestion is that we should consider this for the 2.0 release, as it's a key time to change any defaults (I don't want to change any defaults here, in 1.1, as the idea is that running 1.0 vs 1.1 shouldn't change anything). Let's move / continue any discussion there (I've added some comments on max_n_peaks there).

In terms of the previous discussion of developing a more formal / systematic approach for benchmarking and trying out some of the other ideas that have come up, I've started this issue for that: https://github.com/fooof-tools/fooof/issues/304

Otherwise, back to this PR, I think we can call this done on code changes, and merge in from here - sound good @ryanhammonds , @voytek, @rdgao ?

rdgao commented 1 year ago

sounds good to me!

voytek commented 1 year ago

Yep, I'm good with this as well.

fooof-tools / fooof