Add new mex to run inner loop of spinwave.m in parallel

mducle commented 8 months ago

The new mex file uses Eigen for linear algebra and standard C++11 threads. Eigen operations are numerically slightly different from the BLAS/LAPACK/MKL routines used by Matlab and other mex files so there will be discrepancies. As such the unit tests are set to use the old routines at the moment.

The C++ code in swloop.cpp should be a reproduction of the Matlab code in swloop.m (commented out) but there are some differences... In the Matlab code, the correlation functions are calculated with:

    zedExpF = zeros(2*nMagExt, 3);
    for i1 = 1:3
        zedExpF(:, i1) = transpose([zed(i1,:) .* ExpF, conj(zed(i1,:)) .* ExpF]);
    end
    VExp = zeros(3, 2*nMagExt, 1);
    for i1 = 1:3
       VExp(i1,:,:) = V' * zedExpF(:, i1);
    end
    for i1 = 1:(2*nMagExt)
       Sab(:,:,i1,jj) = VExp(:,i1) * VExp(:,i1)';
    end

But in C++ I had to use the equivalent of:

    zedExpF = zeros(2*nMagExt, 3);
    for i1 = 1:3
        zedExpF(:, i1) = transpose([conj(zed(i1,:)) .* ExpF, zed(i1,:) .* ExpF]);
    end
    VExp = zeros(3, 2*nMagExt, 1);
    for i1 = 1:3
       VExp(i1,:,:) = transpose(V) * zedExpF(:, i1);
    end
    for i1 = 1:(2*nMagExt)
       Sab(:,:,i1,jj) = VExp(:,i1) * VExp(:,i1)';
    end

(e.g. the zed vector conjugate order is swapped and a plain transpose instead of an adjoint (conjugate-transpose) used for the eigenvectors V). This produces a Sab matrix which is the transposed or complex-conjugate of the Matlab one and so fails the unit test but would otherwise produce the wrong intensities... (and so fail the system tests).

Finally, Eigen eig/chol operations are serial unlike MKL so will be much slower for large system. So, the new code has a check and will use the old mex files for these systems (so if nMagExt is larger than 250 then it uses the old mex). This can be seen if running the BiFeO3 performance test with nExt=0.01 which gives nMagExt=768. In this case using the new mex files takes about 10x longer (about 90min vs 10min) than the old mex files. However, for the FMchain test the new mex file is about 10x faster than the old ones because the old files spent much more time in the non-linear algebra-parallelised part of the loop which is now parallelised.

Running the tests in the sw_mex file, I'm also getting speed ups compared to the old mex files but not as large. For the old mex files on my i9-10885H 2.4Gz 8-core laptop I get:

                Hermitian Mex   Hermitian NoMex  NonHermitian Mex  NonHermitian NoMex
Run Time(s)         91.846230        113.842470        130.997670          244.554882 (small model)
Run Time(s)         20.938848         22.510290         22.467369           34.369979 (medium model)
Run Time(s)         66.888651         72.357953         83.515923          326.277674 (large model)

(note the Hermitian Mex medium model calculation is flaky and about 50% of the time takes over 2min instead of 20s).

Whereas with the new mex file, I get:

                Hermitian Mex   Hermitian NoMex  NonHermitian Mex  NonHermitian NoMex
Run Time(s)         57.353621        112.197864         93.396113          242.738207 (small model)
Run Time(s)         10.010880         21.664538         12.770569           33.638017 (medium model)
Run Time(s)         36.645246         73.602794         63.599608          316.180931 (large model)

Running on a large DAaaS workspace with nthreads=30 with the new mex I get:

                Hermitian Mex   Hermitian NoMex  NonHermitian Mex  NonHermitian NoMex
Run Time(s)         67.371210        116.213110        102.080685          245.567790 (small model)
Run Time(s)          7.198715         16.888088         11.547168           28.271158 (medium model)
Run Time(s)         21.887195         51.070703         52.435681          222.432965 (large model)

There are some speed-ups implemented in a related PR on powder fitting which should be ported back to this PR and then this PR should be merged and the powder fitting PR refactored.

Whilst profiling a powder fitting of an incommensurate model, the following bottlenecks were identified:

Calculation of the rotated Sab tensors in incommensurate systems (10%)
Calculation of the dQ convolution in sw_instrument.m (16%)
Calculation Sperp from Sab in sw_neutron.m (4%)
Rebinning the data in sw_egrid.m (2%)

Where percent in brackets is the fractional run time. The fractional runtime for swloop is 42%.

The following needs to implemented:

[x] Change sw_loop.cpp to compute the incommensurate rotated Sab in C++ ~~- [ ] Change sw_loop.cpp to (optionally) compute Sperp in C++ (so avoid using sw_neutron for some applications, e.g. Horace and powspec.~~ (Move to new PR) ~~- [ ] For Horace, also change sw_loop.cpp to only compute Sab for the positive eigenvalues (this is what the current spinwavefast.m does).~~ (Move to new PR)
[x] Add a new mex file to compute the dQ convolution. (Done in powder fitting PR. gives a 3x speed up of the dQ convolution, so it now takes 7% of total time). ~~- [ ] (Possibly) add a new mex file to perform the sw_egrid rebinning.~~ (Move to new PR)

github-actions[bot] commented 8 months ago

Test Results

4 files 106 suites 16m 7s :stopwatch: 680 tests 662 :white_check_mark: 18 :zzz: 0 :x: 1 890 runs 1 854 :white_check_mark: 36 :zzz: 0 :x:

Results for commit 46e6a864.

:recycle: This comment has been updated with latest results.

codecov-commenter commented 6 months ago

Codecov Report

Attention: Patch coverage is 53.84615% with 60 lines in your changes are missing coverage. Please review.

Project coverage is 40.61%. Comparing base (735d30a) to head (bc69619). Report is 2 commits behind head on master.

Files	Patch %	Lines
swfiles/sw_mex.m	0.00%	58 Missing :warning:
swfiles/@spinw/spinwave.m	96.66%	2 Missing :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #163 +/- ## ========================================== + Coverage 40.51% 40.61% +0.10% ========================================== Files 240 240 Lines 15981 16091 +110 ========================================== + Hits 6474 6535 +61 - Misses 9507 9556 +49 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

mducle commented 6 months ago

@RichardWaiteSTFC this PR is ready for review now. I've decided to implement the mex Sperp calculation in another PR because it requires extensive changes to spinwave.m and not just changing the swloop.cpp file.

SpinW / spinw

Add new mex to run inner loop of spinwave.m in parallel #163

Test Results

Codecov Report