Speed problem multi-threading

TimSlendebroek commented 10 months ago

@orso82 @dannysun91

Setting up the julia run with 8 threads:

1 grid point

@time actor_tjlf = FUSE.ActorTGLF(dd,act);
  2.257127 seconds (585.37 k allocations: 350.613 MiB, 3.27% gc time, 1.07% compilation time)

TGLF
  3.480 s (35975 allocations: 3.73 MiB)

2 grid points

@time actor_tjlf = FUSE.ActorTGLF(dd,act);
  3.013535 seconds (1.19 M allocations: 747.034 MiB, 2.03% gc time)

4 grid points

@time actor_tjlf = FUSE.ActorTGLF(dd,act);
  7.237286 seconds (2.28 M allocations: 1.420 GiB, 1.37% gc time)

8 grid points

@time actor_tjlf = FUSE.ActorTGLF(dd,act);
120.840445 seconds (4.48 M allocations: 2.724 GiB, 0.12% gc time)

🤣 Right now it's done using multi threading using Threads.@threads for idx = eachindex(input_tjlfs)

test case used

ini,act = FUSE.case_parameters(:ITER,init_from=:scalars, boundary_from=:scalars);
act.ActorEquilibrium.model = :TEQUILA

dd = FUSE.init(ini,act);
plot(dd.equilibrium)

act.ActorTGLF.model = :TJLF
println("TJJLF")
@time actor_tjlf = FUSE.ActorTGLF(dd,act);

orso82 commented 10 months ago

@TimSlendebroek can you try placing the @time inside of the threaded for loop?

TimSlendebroek commented 10 months ago

Looking at it with Brendan we are kind of concluding that it has to do with the memory allocation being on the large side bringing it down improved the time from 120sec to 50 seconds however this is still bad..

TimSlendebroek commented 10 months ago

@orso82 adding in the @time

julia> @time actor_tjlf = FUSE.ActorTGLF(dd,act); 76.209007 seconds (1.07 M allocations: 1.585 GiB, 0.11% gc time, 0.14% compilation time) 77.415374 seconds (1.10 M allocations: 1.633 GiB, 0.11% gc time, 0.05% compilation time) 77.705742 seconds (1.11 M allocations: 1.636 GiB, 0.11% gc time, 0.32% compilation time) 77.940029 seconds (1.13 M allocations: 1.662 GiB, 0.11% gc time, 0.18% compilation time) 78.253505 seconds (1.16 M allocations: 1.704 GiB, 0.11% gc time, 0.09% compilation time) 78.421187 seconds (1.18 M allocations: 1.727 GiB, 0.12% gc time) 78.709262 seconds (1.19 M allocations: 1.751 GiB, 0.12% gc time, 0.27% compilation time) 79.005604 seconds (1.54 M allocations: 1.775 GiB, 0.15% gc time, 0.59% compilation time: 20% of whic

TimSlendebroek commented 9 months ago

on saga with arguably more memory

@time actor_tjlf = FUSE.ActorTGLF(dd,act);
 43.534075 seconds (1.61 M allocations: 2.161 GiB, 1.57% gc time, 0.12% compilation time)

still very slow compared to asynchmap TGFL 4.910830 seconds (468.04 k allocations: 32.694 MiB, 0.62% gc time, 6.80% compilation time)

TimSlendebroek commented 9 months ago

Idea to use eigenvals / eigenvecs using LinearAlgebra instead of fortran libraries

dannysun91 commented 9 months ago

eigenvals / eigenvecs only solves the standard eigenvalue problem not the generalized eigenvalue problem, there is eigen() in the LA library, but that solves both the eigenvalues and eigenvectors all at once

orso82 commented 9 months ago

I asked ChatGPT if indeed hypertheading could be the issue, and if that could explain what we are seeing with running multiple instances of TGLF Vs multi threading in TJLF.

=========

If you're observing that running multiple instances of the same program scales well, while a single instance with multiple threads doesn't, it might be due to several factors:

Operating System Scheduling: Operating systems are generally very good at managing multiple processes. When you run multiple instances of a program, each instance is treated as a separate process. The OS can efficiently allocate CPU time to these processes, balancing the load across the physical and logical cores.
Resource Contention within a Single Process: In a multi-threaded application, threads share the same process space, meaning they share resources like memory. This can lead to contention, especially if the threads access shared data or resources. This doesn't happen with separate processes (or happens less), as each process has its own memory space.
Hyperthreading and Core Utilization: If your CPU supports hyperthreading, each physical core can run multiple threads simultaneously. However, hyperthreading is more beneficial for mixed workloads. In a CPU-intensive task like eigenvalue computation, hyperthreading might not offer significant benefits as all threads are vying for the same core resources.
Cache Utilization: In a single multi-threaded process, all threads share the same cache, which can lead to cache contention and thrashing. Separate processes have separate caches, reducing contention.
Differences in Parallelization Overhead: Managing threads within a single process, especially for complex tasks like eigenvalue computations, can have overhead due to synchronization, context switching, and managing shared resources. Separate processes do not have this overhead as they are independent of each other.
NUMA (Non-Uniform Memory Access) Considerations: If your system is NUMA-enabled, memory access patterns could affect performance. Separate processes might be better optimized by the OS to handle NUMA effects than a single multi-threaded process.

Understanding these differences can help in optimizing your application. Profiling and analyzing both scenarios (single multi-threaded instance vs multiple processes) can provide insights into where the bottlenecks or inefficiencies lie in your multi-threaded application.

orso82 commented 9 months ago

This may be relevant

https://discourse.julialang.org/t/multithreaded-lapack-function-in-a-threads-threads-loop/45205/3

https://carstenbauer.github.io/ThreadPinning.jl/dev/explanations/blas/

orso82 commented 9 months ago

also note that in Julia v1.10 eigvals/eigen(A, cholesky(B)) now computes the generalized eigenvalues (eigen: and eigenvectors) of A and B via Cholesky decomposition for positive definite B. Note: The second argument is the output of cholesky.

orso82 commented 9 months ago

https://github.com/JuliaLang/julia/issues/49455

@dannysun91 can you please try to install MKL and see how things change? https://github.com/JuliaLinearAlgebra/MKL.jl

dannysun91 commented 9 months ago

+--------------+---------------+----------+
| BLAS Threads | Julia Threads | @btime   |
+--------------+---------------+----------+
| 1            | 8             | 15.949 s |
+--------------+---------------+----------+
| 2            | 8             | 17.326 s |
+--------------+---------------+----------+
| 1            | 4             | 16.062 s |
+--------------+---------------+----------+
| 2            | 4             | 16.913 s |
+--------------+---------------+----------+
| 1            | 2             | 18.380 s |
+--------------+---------------+----------+
| 2            | 2             | 25.536 s |
+--------------+---------------+----------+
| 1            | 1             | 28.495 s |
+--------------+---------------+----------+
| 2            | 1             | 31.166 s |
+--------------+---------------+----------+

running 7 radial points on my machine with 2 cores... BLAS threads = 1 is almost always faster

1 Julia Thread
BLAS = 1, matrix = 100 x 100
  0.006673 seconds (2 allocations: 78.172 KiB)
BLAS = 2, matrix = 100 x 100
  0.003576 seconds (2 allocations: 78.172 KiB)
BLAS = 1, matrix = 1000 x 1000
  0.587828 seconds (2 allocations: 7.629 MiB, 90.85% gc time)
BLAS = 2, matrix = 1000 x 1000
  0.063542 seconds (2 allocations: 7.629 MiB)
BLAS = 1, matrix = 10000 x 10000
 58.265402 seconds (2 allocations: 762.939 MiB, 0.02% gc time)
BLAS = 2, matrix = 10000 x 10000
 42.438390 seconds (2 allocations: 762.939 MiB, 1.10% gc time)

2 Julia Threads, do AxA 4 times
BLAS = 1, matrix = 100 x 100
  0.000379 seconds (33 allocations: 314.250 KiB)
BLAS = 2, matrix = 100 x 100
  0.001246 seconds (34 allocations: 314.281 KiB)
BLAS = 1, matrix = 1000 x 1000
  0.177107 seconds (34 allocations: 30.519 MiB)
BLAS = 2, matrix = 1000 x 1000
  0.193362 seconds (34 allocations: 30.519 MiB)
BLAS = 1, matrix = 10000 x 10000
196.602635 seconds (34 allocations: 2.980 GiB, 0.14% gc time)
BLAS = 2, matrix = 10000 x 10000
117.752954 seconds (33 allocations: 2.980 GiB, 0.23% gc time)

@ time calls of a AxA matrix multiplication, the run with 2 Julia thread does the multiplication 4 times, it appears that you get some speed up with BLAS = 2, but not 2x

1 Julia Thread
BLAS = 1, matrix = 100 x 100
  0.004739 seconds (9 allocations: 35.141 KiB)
BLAS = 2, matrix = 100 x 100
  0.006415 seconds (9 allocations: 35.141 KiB)
BLAS = 1, matrix = 1000 x 1000
  5.573633 seconds (9 allocations: 344.641 KiB)
BLAS = 2, matrix = 1000 x 1000
  6.346902 seconds (9 allocations: 344.641 KiB)

2 Julia Threads, do ggev!('N','N',A,B) 4 times
BLAS = 1, matrix = 100 x 100
  0.011220 seconds (77 allocations: 767.531 KiB)
BLAS = 2, matrix = 100 x 100
  0.017336 seconds (77 allocations: 767.531 KiB)
BLAS = 1, matrix = 1000 x 1000
 11.490549 seconds (77 allocations: 62.383 MiB, 2.46% gc time)
BLAS = 2, matrix = 1000 x 1000
 11.757032 seconds (77 allocations: 62.383 MiB, 0.07% gc time)

The speed up is not observed in ggev!() and if anything BLAS=1 is faster. In TJLF, the bottleneck of multithreading occurred in the ggev!() function (when BLAS = 2). Matrices used for ggev!() is on the order of magnitude of ~100

dannysun91 commented 9 months ago

+--------------+---------------+----------+
| BLAS Threads | Julia Threads | @btime   |
+--------------+---------------+----------+
| 1            | 4             | 6.800 s  |
+--------------+---------------+----------+
| 2            | 4             | 7.017 s  |
+--------------+---------------+----------+
| 1            | 2             | 8.627 s  |
+--------------+---------------+----------+
| 2            | 2             | 8.390 s  |
+--------------+---------------+----------+
| 1            | 1             | 13.209 s |
+--------------+---------------+----------+
| 2            | 1             | 15.309 s |
+--------------+---------------+----------+

using MKL on 7 radial points

ProjectTorreyPines / TJLF.jl