SpinW / spinw

SpinW Matlab library for spin wave calculation
http://www.spinw.org
GNU General Public License v3.0
37 stars 15 forks source link

Pre-allocate memory for Sab and speed-up sw_timeit #125

Closed RichardWaiteSTFC closed 1 year ago

RichardWaiteSTFC commented 1 year ago

Small/un-controversial performance improvements to spinwave:

1. Replacing mmat with sw_mtimesx if mex is enabled - note the changes are in branches that only affect incommensurate calculations.

  1. Stop retrieving the default tid from spinw preferences if a tid is supplied in the input to sw_timeit (profiling suggests this was causing a slow-down in spinwave - and it is a harmless change)
  2. Pre-allocate memory for Sab

There are various places where potentially large 2D arrays undergo matrix multiplication where in theory sw_mtimesx can be used - for example https://github.com/SpinW/spinw/blob/35fccdd527ddb8cb70f684be8dbc7bb428b49056/swfiles/%40spinw/spinwave.m#L875 which is a known bottleneck, however using sw_mtimesx here (in the loop) actually caused a slowdown (it would be great if the loop could be removed by using sw_mtimesx but I don't think it's possible in this case). I think as a rule-of-thumb it seems sw_mtimesx is only worth it for ND arrays with N>2 (where mmat would be used)?

Below is a table of the spinwave execution time in this PR and on main - it can be seen that for FMchain and commensurate supercell calculation of BiFeO3 there is negligible speed-up (as expected) - but for the incommensurate BiFeO3 calculation (with nExt=[1,1,1]) the speedup with mex enabled is ~5-15% over the same calculation in main (although the incomm. calculations execute quite quickly already...maybe I should add more q-points for the incomm. calc?).

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

  | main | PR #125 -- | -- | -- BiFeO3_mex_0_nExt_1_1_1_hermit_0_optmem_0 | 3.1929e-01(1.1915e-02) | 3.0752e-01(2.1605e-02) BiFeO3_mex_0_nExt_1_1_1_hermit_0_optmem_10 | 4.3416e-01(1.6010e-02) | 3.7768e-01(6.0785e-03) BiFeO3_mex_0_nExt_1_1_1_hermit_0_optmem_5 | 4.0180e-01(2.3122e-02) | 3.7173e-01(3.9000e-02) BiFeO3_mex_1_nExt_0.01_hermit_0_optmem_0 | 3.6457e+02(2.1053e+00) | 3.6418e+02(2.7233e+00) BiFeO3_mex_1_nExt_0.01_hermit_0_optmem_10 | 1.7745e+02(9.4996e-01) | 1.7716e+02(1.7695e+00) BiFeO3_mex_1_nExt_0.01_hermit_0_optmem_5 | 2.1555e+02(1.3666e+00) | 2.1691e+02(3.3399e+00) BiFeO3_mex_1_nExt_1_1_1_hermit_0_optmem_0 | 4.0193e-01(2.6359e-02) | 3.6338e-01(4.5940e-03) BiFeO3_mex_1_nExt_1_1_1_hermit_0_optmem_10 | 4.2629e-01(4.0982e-02) | 3.2217e-01(1.9319e-02) BiFeO3_mex_1_nExt_1_1_1_hermit_0_optmem_5 | 4.0840e-01(5.2889e-02) | 3.3762e-01(1.4502e-02) BiFeO3_mex_1_nExt_1_1_1_hermit_1_optmem_0 | 3.8234e-01(1.7196e-02) | 3.5420e-01(1.3514e-02) BiFeO3_mex_1_nExt_1_1_1_hermit_1_optmem_10 | 4.1838e-01(2.2458e-02) | 3.3446e-01(2.2666e-02) BiFeO3_mex_1_nExt_1_1_1_hermit_1_optmem_5 | 3.9659e-01(4.0123e-02) | 3.0952e-01(1.4341e-02)   |   |   FMchain_mex_0_hermit_0_optmem_0 | 2.7201e+02(1.9793e+00) | 2.7310e+02(1.3435e+00) FMchain_mex_0_hermit_0_optmem_10 | 2.7969e+02(2.2099e+00) | 2.8312e+02(6.0617e+00) FMchain_mex_0_hermit_0_optmem_5 | 2.7276e+02(1.3241e+00) | 2.7465e+02(2.1455e+00) FMchain_mex_0_hermit_1_optmem_0 | 1.9493e+02(5.7622e-01) | 1.9614e+02(8.0672e-01) FMchain_mex_0_hermit_1_optmem_10 | 1.9588e+02(2.3288e-01) | 1.9545e+02(5.2716e-01) FMchain_mex_0_hermit_1_optmem_5 | 1.9533e+02(1.1432e+00) | 1.9554e+02(1.5650e+00) FMchain_mex_1_hermit_0_optmem_0 | 6.9014e+01(1.6683e-01) | 7.0000e+01(2.4902e-01) FMchain_mex_1_hermit_0_optmem_10 | 6.5271e+01(8.8614e-01) | 6.4427e+01(8.0258e-01) FMchain_mex_1_hermit_0_optmem_5 | 6.6100e+01(4.0403e-01) | 6.5784e+01(9.0951e-01) FMchain_mex_1_hermit_1_optmem_0 | 3.7491e+01(1.3524e-01) | 3.7927e+01(4.5511e-01) FMchain_mex_1_hermit_1_optmem_10 | 3.8745e+01(5.5424e-01) | 3.8318e+01(4.9142e-01) FMchain_mex_1_hermit_1_optmem_5 | 3.8015e+01(5.3629e-01) | 3.8026e+01(2.2739e-01)

github-actions[bot] commented 1 year ago

Test Results

       4 files  ±0       94 suites  ±0   9m 17s :stopwatch: + 1m 23s    506 tests ±0     506 :heavy_check_mark: ±0  0 :zzz: ±0  0 :x: ±0  1 514 runs  ±0  1 514 :heavy_check_mark: ±0  0 :zzz: ±0  0 :x: ±0 

Results for commit ce94dc18. ± Comparison against base commit 35fccdd5.

:recycle: This comment has been updated with latest results.

RichardWaiteSTFC commented 1 year ago

When I increase the number of q-points (from 30 to 1e4) for incommensurate BiFeO3 calculation on IDAaaS these changes seem to cause a slow-down - I'll do some cherry-picking to see which commit is the cause

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

  | MAIN | PR #125 -- | -- | -- BiFeO3_mex_0_nExt_1_1_1_hermit_0_optmem_0 | 1.8683e+01(2.6102e-01) | 1.9312e+01(4.4811e-01) BiFeO3_mex_0_nExt_1_1_1_hermit_0_optmem_10 | 1.8930e+01(2.4678e-01) | 1.9166e+01(4.5013e-01) BiFeO3_mex_0_nExt_1_1_1_hermit_0_optmem_5 | 1.8901e+01(4.5469e-01) | 1.9385e+01(2.6142e-01) BiFeO3_mex_1_nExt_1_1_1_hermit_0_optmem_0 | 6.4221e+00(4.7905e-02) | 8.1439e+00(2.0542e-01) BiFeO3_mex_1_nExt_1_1_1_hermit_0_optmem_10 | 6.7484e+00(1.0641e-01) | 7.9144e+00(1.4417e-01) BiFeO3_mex_1_nExt_1_1_1_hermit_0_optmem_5 | 6.6507e+00(1.3011e-01) | 8.0329e+00(1.0516e-01) BiFeO3_mex_1_nExt_1_1_1_hermit_1_optmem_0 | 5.7267e+00(1.9035e-01) | 7.2870e+00(1.6608e-01) BiFeO3_mex_1_nExt_1_1_1_hermit_1_optmem_10 | 6.0577e+00(4.8642e-02) | 7.2956e+00(1.1060e-01) BiFeO3_mex_1_nExt_1_1_1_hermit_1_optmem_5 | 5.8612e+00(8.6024e-02) | 7.2355e+00(1.5990e-01)

rebeccafair commented 1 year ago

These changes look nice and clean to me! And thanks a lot for the detailed explanation, it is very helpful!

I agree that the BiFeO3 incomm calculations are very short, it probably isn't worth focusing on the short cases (but we should just keep an eye on it so we don't increase it by e.g. an order of magnitude). We should focus on the long execution times, i.e. the large cell case, or many q-point case.

Given that, it is concerning that increasing qpoints from 30 to 1e4 for incommensurate BiFeO3 calculation gives a slowdown, yes we get a gain using 30 qpoints but it is only from 0.38 -> 0.35 seconds which wouldn't provide a benefit. We should check for larger numbers of q-points too (even 1e4 q-points only needs ~6-8 seconds) and decide on that whether these changes are beneficial overall or not.

RichardWaiteSTFC commented 1 year ago

Given that, it is concerning that increasing qpoints from 30 to 1e4 for incommensurate BiFeO3 calculation gives a slowdown, yes we get a gain using 30 qpoints but it is only from 0.38 -> 0.35 seconds which wouldn't provide a benefit. We should check for larger numbers of q-points too (even 1e4 q-points only needs ~6-8 seconds) and decide on that whether these changes are beneficial overall or not.

Yep it's ~20s for non-mex but mex is ~6s - both are too short really, I agree I need to increase the q-points further. We may want to optimise for many q-points but there may be different performance issues with an incommensurate calc. with more nmag atoms (the BiFeO3 incomm. calc has 6 magnetic atoms - which I think is probably representative of normal usage). What do you both think @rebeccafair @mducle ?

mducle commented 1 year ago

@RichardWaiteSTFC I think people would use ~200-1000 q-points for typical dispersion plot along high symmetry lines, and 1e4-1e5 q-points to simulate small INS spectra (1D cuts or small 2D slices). For larger slices or if they use the Monte Carlo resolution convolution you'd be looking at 1e7-1e8 q-points, whilst the full 4D dataset would be about 1e10 q-points, but we cannot currently simulate that.

So, I think the tests should use at least 1e3 q-points, and for "stress" testing maybe more, like 1e5? But we don't want it to run forever...

RichardWaiteSTFC commented 1 year ago

So I ran the incom. BiFeO3 test (without profiling) with mex enabled on main and this branch on IDAaaS. The results would seem to indicate a significant speed-up for the hermit-true calc (though MATLAB did crash on test 3/5 on main branch- possibly due to memory?).

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

  | Main | PR #125 -- | -- | -- BiFeO3_mex_1_nExt_1_1_1_hermit_0_optmem_0 | 9.9570e+01(7.7289e+00) | 8.6421e+01(9.3289e+00) BiFeO3_mex_1_nExt_1_1_1_hermit_1_optmem_0 | 2.7830e+02(2.2302e+01) | 8.6280e+01(1.3106e+01)

But there's no reason I can find the speed-up would only affect the hermit=true calculation and indeed when I re-run the calculation this morning the hermit=True and hermit=false calculations both ran in ~60s on main - i.e. much quicker than last night. So I don't know how much to trust these numbers...

rebeccafair commented 1 year ago

I have had weird profiling results on IDAaaS before, perhaps it could be an inconsistent network/file system access issue?

I really am in two minds about these changes, we haven't really seen a significant speedup in any of the cases, and I'm not sure if it's worth the risk of actually slowing things down or the additional risk of crashes...

RichardWaiteSTFC commented 1 year ago

I have had weird profiling results on IDAaaS before, perhaps it could be an inconsistent network/file system access issue?

I really am in two minds about these changes, we haven't really seen a significant speedup in any of the cases, and I'm not sure if it's worth the risk of actually slowing things down or the additional risk of crashes...

Fair enough, the crashes I have only seen on main though... I think we could make the last two changes and be confident we are not slowing down the code:

I thought the sw_mtimesx change would be safe but there will be an overhead involved in calling the mex function and I don't know how it affects the memory usage. The current performance of the incom. BiFeO3 calc. seems acceptable (~1min for 1e5 q-points) so happy to ditch it!

mducle commented 1 year ago

I think we could make the last two changes and be confident we are not slowing down the code:

I agree.

I thought the sw_mtimesx change would be safe but there will be an overhead involved in calling the mex function and I don't know how it affects the memory usage.

Yeah, I'd thought it would be safe too but I guess we need to investigate more why it's not. Given this, I'm not sure that my intuition that the mex should use less memory is valid either...

RichardWaiteSTFC commented 1 year ago

I removed the first commit - so now I only do (2-3) as per the PR description. I will update the PR title