Restructuring in blax()

ngomezve commented 3 months ago

This PR addresses a potential issue in some processors, in which the LU decomposition in blax() could take several orders of magnitude longer than expected.

The LU matrix is also restructured to enable future performance improvements.

codecov-commenter commented 3 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 75.95%. Comparing base (78ba4b2) to head (9a1f258). Report is 5 commits behind head on main.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #58 +/- ## ========================================== - Coverage 76.41% 75.95% -0.46% ========================================== Files 69 71 +2 Lines 13169 13401 +232 ========================================== + Hits 10063 10179 +116 - Misses 3106 3222 +116 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

askprash commented 3 months ago

Great that we are looking into this! How did you benchmark your code? I'm running into something strange as I tried to benchmark...

I'm find that upstream/main has the following performance (benchmarked using BenchmarkTools on a Mac M2):

Benchmarking... blax
---------------------------------------
blax (FORTRAN on MacPro M2 ~ 1.9 ms)
---------------------------------------
BenchmarkTools.Trial: 3525 samples with 5 evaluations.
 Range (min … max):  1.509 ms …   9.381 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.596 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.672 ms ± 258.716 μs  ┊ GC (mean ± σ):  3.84% ± 8.09%

    ▁▆███▇▅▄▂                  ▁  ▁             ▁             ▁
  ▃▇██████████▇▅▅▄▄▄▄▄▄▄▄▄▆▇███████▆█▇▇▇▄▅▅▇▇▇█████▇██▆▇▅▅▅▅▄ █
  1.51 ms      Histogram: log(frequency) by time      2.32 ms <

 Memory estimate: 2.11 MiB, allocs estimate: 44241.

Whereas benchmarking this commit gives me:

---------------------------------------
blax (FORTRAN on MacPro M2 ~ 1.9 ms)
---------------------------------------
BenchmarkTools.Trial: 1439 samples with 5 evaluations.
 Range (min … max):  3.787 ms …   5.184 ms  ┊ GC (min … max): 0.00% … 8.34%
 Time  (median):     4.068 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.117 ms ± 220.747 μs  ┊ GC (mean ± σ):  1.42% ± 2.54%

        █▃▁       ▃                                            
  ▄▇██▆▅███▆█▇▇▇███▇▆▆▇▅▄▅▄▅▅▆▆█▆█▇█▆▆█▆▆▄▄▄▄▄▃▃▃▃▃▂▃▃▂▂▁▁▂▂▃ ▄
  3.79 ms         Histogram: frequency by time        4.71 ms <

 Memory estimate: 5.08 MiB, allocs estimate: 44881.

I am surprised by the results. This seems to suggest that somehow it is significantly slower (~4 ms for de462ed vs 1.6 ms for main).

My initial thought was some unnecessary allocations were the culprit like here: https://github.com/MIT-LAE/TASOPT.jl/blob/de462edce404a4d24b149940b4622e8aed00d9c4/src/aero/blax.jl#L357-L359

But replacing that with something like:

rsys[:] .= 0.0
nonzeros(asys) .= 0.0

does improve the performance but it is still somehow slower than the dense system.

BenchmarkTools.Trial: 2695 samples with 5 evaluations.
 Range (min … max):  2.043 ms …   4.906 ms  ┊ GC (min … max): 0.00% … 9.60%
 Time  (median):     2.174 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.197 ms ± 154.466 μs  ┊ GC (mean ± σ):  1.44% ± 1.98%

     ▁▄██▂ ▂▄▅▃▃▂▃▂▁▁▂  ▁                                      
  ▃▄███████████████████▇█▇▇▆▇▅▆▆▄▄▄▄▄▄▃▃▃▃▃▂▃▂▂▂▂▂▂▁▂▂▂▂▂▁▂▁▂ ▄
  2.04 ms         Histogram: frequency by time        2.59 ms <

 Memory estimate: 4.43 MiB, allocs estimate: 44781

Any thoughts? I'll investigate more later.

askprash commented 3 months ago

Also looking at this part of the code now makes me cringe (and I'm the one to blame)- let's just say there is a lot of clean up that can be done here. 🧹

ngomezve commented 3 months ago

I profiled wsize() using Profile and ViewProfile. In particular, I did this for a hydrogen aircraft sizing case, where blax() is called every sizing loop. I also did a benchmark for the entire wsize() and it was significantly faster after the change.

askprash commented 2 months ago

Did a little more digging here and seems like this comes down to the BLAS library (default is OpenBLAS) that LinearAlgebra is using and the number of threads that the library defaults to.

Default setup on hex

On the Intel Xeon processors the difference in performance is drastic between using 1 CPU vs 24, and I think this is because the default number of threads that BLAS is trying to use is the total number of CPUs/2 even if that number of threads are not available.

CPUs	median time
1	8.6 s
4	5.0 s
16	1.8 s
24	5.1 ms

In all these cases it looks like BLAS.get_num_threads() returns 12.

Enforcing BLAS.set_num_threads(1) gives vastly different performance to that above and is roughly constant at ~3.5 ms:

Performance specifying a single thread

| CPUs | median time | | :----: | :---: | | 1 | 3.5 ms | | 4 | 3.5 ms| | 16 | 3.5 ms | | 24 | 3.5 ms |

Repeating this exercise on an AMD node is consistently faster with median times about ~2.7 ms across the number of CPUs (1,4,16,24) tested. Interestingly, in all these cases the sparse calculations are slower at ~4.8 ms.

Other BLAS implementations (MKL)

I found some discussion about this on Julia Discourse (this post in particular was very interesting). One of the things I tried was to use other libraries. I found that the MKL library has a default num_threads that seems to be better tuned to get the right performance? BLAS.get_num_threads() seems to return the number of cores and blax.jl takes approximately independent of the number of cores and is roughly the same as the OpenBLAS library (~3.2 ms).

I'm also tried to test this on my M2 mac with the AppleAccelerate.jl BLAS implementations that supposedly take advantage of the AppleAccelerate framework. It seems like our matrix is rather small to really take advantage of any of this probably:

julia> BLAS.get_config()
LinearAlgebra.BLAS.LBTConfig
Libraries: 
└ [ILP64] libopenblas64_.dylib

BenchmarkTools.Trial: 4083 samples with 5 evaluations.
 Range (min … max):  1.332 ms …  10.070 ms  ┊ GC (min … max): 0.00% … 18.14%
 Time  (median):     1.389 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.443 ms ± 274.095 μs  ┊ GC (mean ± σ):  2.85% ±  5.56%

     ▂▄▆██▇▅▃▂▁               ▁▁▁▂▂▂▃▂▂▂▁▂▁▁                  ▂
  ▅█████████████▇▇█▇▆▅▅▅▅▄▃▅▆█████████████████▆▆▆▆▅▅▅▁▄▅▅▄▅▃▄ █
  1.33 ms      Histogram: log(frequency) by time      1.78 ms <

 Memory estimate: 2.11 MiB, allocs estimate: 44241.

julia> BLAS.get_config()
LinearAlgebra.BLAS.LBTConfig
Libraries: 
├ [ LP64] Accelerate
└ [ILP64] Accelerate

BenchmarkTools.Trial: 4758 samples with 5 evaluations.
 Range (min … max):  1.163 ms …  2.120 ms  ┊ GC (min … max): 0.00% … 38.15%
 Time  (median):     1.225 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.260 ms ± 88.886 μs  ┊ GC (mean ± σ):  3.04% ±  6.10%

      ▂▃▄▅▇██▆▄▂                          ▁▂▂▃▂▂▂▂▂▁▁ ▁      ▂
  ▃▅▇▇████████████▇▅▁▃▃▃▃▄▃▁▁▁▁▁▁▁▁▆▄▆▆▇▆█████████████████▇▇ █
  1.16 ms      Histogram: log(frequency) by time     1.52 ms <

 Memory estimate: 2.11 MiB, allocs estimate: 44241.

Way forward?

Perhaps we just set BLAS.set_num_threads(1) as the default (and document this) and stick to the dense matrix calculations since it's about 40% faster (I still need to understand why the sparse version isn't as fast).

ngomezve commented 2 months ago

Thanks for looking into this @askprash.

I agree that keeping the dense operation is the way forward for now. I'll update the PR with that. I'll keep the matrix restructuring into a tetradiagonal matrix with a block as it does not appear to make a difference in terms of allocations, memory or speed, but it may enable other methods in the future.

ngomezve commented 2 months ago

Upon further testing, it appears that the BLAS.set_num_threads(1) only gives you a performance improvement when it is added to the script that runs TASOPT rather than to the TASOPT source directly. As a workaround, I added an __init__ function to the aero module; this seems to work.

MIT-LAE / TASOPT.jl