Small/un-controversial performance improvements to spinwave:
1. Replacing mmat with sw_mtimesx if mex is enabled - note the changes are in branches that only affect incommensurate calculations.
Stop retrieving the default tid from spinw preferences if a tid is supplied in the input to sw_timeit (profiling suggests this was causing a slow-down in spinwave - and it is a harmless change)
Pre-allocate memory for Sab
There are various places where potentially large 2D arrays undergo matrix multiplication where in theory sw_mtimesx can be used - for example
https://github.com/SpinW/spinw/blob/35fccdd527ddb8cb70f684be8dbc7bb428b49056/swfiles/%40spinw/spinwave.m#L875
which is a known bottleneck, however using sw_mtimesx here (in the loop) actually caused a slowdown (it would be great if the loop could be removed by using sw_mtimesx but I don't think it's possible in this case). I think as a rule-of-thumb it seems sw_mtimesx is only worth it for ND arrays with N>2 (where mmat would be used)?
Below is a table of the spinwave execution time in this PR and on main - it can be seen that for FMchain and commensurate supercell calculation of BiFeO3 there is negligible speed-up (as expected) - but for the incommensurate BiFeO3 calculation (with nExt=[1,1,1]) the speedup with mex enabled is ~5-15% over the same calculation in main (although the incomm. calculations execute quite quickly already...maybe I should add more q-points for the incomm. calc?).
Small/un-controversial performance improvements to spinwave:
1. Replacingmmat
withsw_mtimesx
if mex is enabled - note the changes are in branches that only affect incommensurate calculations.tid
from spinw preferences if atid
is supplied in the input tosw_timeit
(profiling suggests this was causing a slow-down inspinwave
- and it is a harmless change)Sab
There are various places where potentially large 2D arrays undergo matrix multiplication where in theory
sw_mtimesx
can be used - for example https://github.com/SpinW/spinw/blob/35fccdd527ddb8cb70f684be8dbc7bb428b49056/swfiles/%40spinw/spinwave.m#L875 which is a known bottleneck, however usingsw_mtimesx
here (in the loop) actually caused a slowdown (it would be great if the loop could be removed by usingsw_mtimesx
but I don't think it's possible in this case). I think as a rule-of-thumb it seemssw_mtimesx
is only worth it for ND arrays with N>2 (wheremmat
would be used)?Below is a table of the
spinwave
execution time in this PR and on main - it can be seen that for FMchain and commensurate supercell calculation of BiFeO3 there is negligible speed-up (as expected) - but for the incommensurate BiFeO3 calculation (withnExt=[1,1,1]
) the speedup with mex enabled is ~5-15% over the same calculation in main (although the incomm. calculations execute quite quickly already...maybe I should add more q-points for the incomm. calc?).<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">
| main | PR #125 -- | -- | -- BiFeO3_mex_0_nExt_1_1_1_hermit_0_optmem_0 | 3.1929e-01(1.1915e-02) | 3.0752e-01(2.1605e-02) BiFeO3_mex_0_nExt_1_1_1_hermit_0_optmem_10 | 4.3416e-01(1.6010e-02) | 3.7768e-01(6.0785e-03) BiFeO3_mex_0_nExt_1_1_1_hermit_0_optmem_5 | 4.0180e-01(2.3122e-02) | 3.7173e-01(3.9000e-02) BiFeO3_mex_1_nExt_0.01_hermit_0_optmem_0 | 3.6457e+02(2.1053e+00) | 3.6418e+02(2.7233e+00) BiFeO3_mex_1_nExt_0.01_hermit_0_optmem_10 | 1.7745e+02(9.4996e-01) | 1.7716e+02(1.7695e+00) BiFeO3_mex_1_nExt_0.01_hermit_0_optmem_5 | 2.1555e+02(1.3666e+00) | 2.1691e+02(3.3399e+00) BiFeO3_mex_1_nExt_1_1_1_hermit_0_optmem_0 | 4.0193e-01(2.6359e-02) | 3.6338e-01(4.5940e-03) BiFeO3_mex_1_nExt_1_1_1_hermit_0_optmem_10 | 4.2629e-01(4.0982e-02) | 3.2217e-01(1.9319e-02) BiFeO3_mex_1_nExt_1_1_1_hermit_0_optmem_5 | 4.0840e-01(5.2889e-02) | 3.3762e-01(1.4502e-02) BiFeO3_mex_1_nExt_1_1_1_hermit_1_optmem_0 | 3.8234e-01(1.7196e-02) | 3.5420e-01(1.3514e-02) BiFeO3_mex_1_nExt_1_1_1_hermit_1_optmem_10 | 4.1838e-01(2.2458e-02) | 3.3446e-01(2.2666e-02) BiFeO3_mex_1_nExt_1_1_1_hermit_1_optmem_5 | 3.9659e-01(4.0123e-02) | 3.0952e-01(1.4341e-02) | | FMchain_mex_0_hermit_0_optmem_0 | 2.7201e+02(1.9793e+00) | 2.7310e+02(1.3435e+00) FMchain_mex_0_hermit_0_optmem_10 | 2.7969e+02(2.2099e+00) | 2.8312e+02(6.0617e+00) FMchain_mex_0_hermit_0_optmem_5 | 2.7276e+02(1.3241e+00) | 2.7465e+02(2.1455e+00) FMchain_mex_0_hermit_1_optmem_0 | 1.9493e+02(5.7622e-01) | 1.9614e+02(8.0672e-01) FMchain_mex_0_hermit_1_optmem_10 | 1.9588e+02(2.3288e-01) | 1.9545e+02(5.2716e-01) FMchain_mex_0_hermit_1_optmem_5 | 1.9533e+02(1.1432e+00) | 1.9554e+02(1.5650e+00) FMchain_mex_1_hermit_0_optmem_0 | 6.9014e+01(1.6683e-01) | 7.0000e+01(2.4902e-01) FMchain_mex_1_hermit_0_optmem_10 | 6.5271e+01(8.8614e-01) | 6.4427e+01(8.0258e-01) FMchain_mex_1_hermit_0_optmem_5 | 6.6100e+01(4.0403e-01) | 6.5784e+01(9.0951e-01) FMchain_mex_1_hermit_1_optmem_0 | 3.7491e+01(1.3524e-01) | 3.7927e+01(4.5511e-01) FMchain_mex_1_hermit_1_optmem_10 | 3.8745e+01(5.5424e-01) | 3.8318e+01(4.9142e-01) FMchain_mex_1_hermit_1_optmem_5 | 3.8015e+01(5.3629e-01) | 3.8026e+01(2.2739e-01)