dsondak / optRBC

Optimal solutions in 2D Rayleigh Benard Convection.
3 stars 1 forks source link

OpenMP #10

Open michaelneuder opened 3 years ago

michaelneuder commented 3 years ago

A few timing results

num threads, timing of loop 3 1, 0.92242285818792880 2, 0.66375376703217626 4, 0.66562559804879129 8, 0.67282426310703158

michaelneuder commented 3 years ago

I should note that the timings above are for Nx=1600 and Ny=1275

michaelneuder commented 3 years ago

Ok after coming back to this, now my timing results are looking much better! I wonder if this specific compute node I am on is better, or if I was doing something wrong before. Either way, we are now seeing some better speedups!

Here is a little table:

threads, timing, speedup 1, 0.52083734888583422, 1 2, 0.44693268812261522, 1.16 4, 0.37362689198926091, 1.39 8, 0.23520854488015175, 2.21

Generally, past this point the increasing thread count isn't actually useful.

michaelneuder commented 3 years ago

Ok I am having trouble getting replicable results unfortunately. Now when I connect to the academic cluster, I get the performance below for the exact same experiment as the previous comment.

1, 0.51881068991497159, 1 2, 0.45359245501458645, 1.14 4, 0.57991017214953899, 0.89 8, 0.57748107495717704, 0.90

This is discouraging, as without replicability, we can't properly test or benchmark our implementations. So we need to sort this out ASAP.

michaelneuder commented 3 years ago

Ok some really great results! I switched to AWS to sanity check what was happening with the performance and ran the same experiment with much better success! On a t2.2xlarge instance, for the loop 3 runtime I was able to get nearly perfect linear scaling!! Phew this is a huge relief.

threads, timing, speedup 1, 0.44721634100005758, 1 2, 0.22189785400007622, 2.01 4, 0.11443732100008219, 3.91 8, 0.0568768589999, 7.86

Even though this is a single loop, speeding it up actually dramatically impact the overall runtime of the code. For a single time step in serial, we have a runtime of 3.93, but with 8 threads I am getting 2.33, which is a 1.69x speedup! Pretty awesome for a single parallel loop.

michaelneuder commented 3 years ago

I am also parallelizing loop 5 in the calc_explicit call. When using 8 threads, we see the timing drop from 0.11551769900006548 to 0.015365126999995482, which is a 7.5x speedup, and drops the overall timing of a single iteration to 1.98, which gives an overall speedup of 3.93/1.98 = 1.98x for the entire code!

michaelneuder commented 3 years ago

Running for 50 iterations in serial we have the following timing output

real 3m34.483s user 3m29.935s sys 0m3.846s

and for parallel

real 1m51.155s user 3m38.685s sys 0m0.887s

So we are seeing a 214 (s) / 111 (s) = 1.93x speedup. I also verified correctness by checking the nusselt numbers and they are identical for both runs.

michaelneuder commented 3 years ago

Ok I made each loop in calc explicit parallel and got the overall runtime down to

real 1m39.510s user 3m46.794s sys 0m0.849s

So we have 214 (s) / 99 (s) = 2.16x speedup!

michaelneuder commented 3 years ago

OK, I am trying to parallelize other parts of the main loop besides calc_explicit, and am running into some weird behavior. It can be boiled down to the example below.

   !$OMP PARALLEL DO num_threads(8) private(tmp_uy) schedule(dynamic)
   do it = 1,Nx
      !$OMP CRITICAL
      ! Solve for v
      call calc_vi(tmp_uy, phi(:,it))
      uy(:,it) = tmp_uy
      ! Solve for u
      if (kx(it) /= 0.0_dp) then
         !ux(:,it) = -CI*d1y(tmp_uy)/kx(it)
         ux(:,it) = CI*d1y(tmp_uy)/kx(it)
      else if (kx(it) == 0.0_dp) then
         ux(:,it) = cmplx(0.0_dp, 0.0_dp, kind=C_DOUBLE_COMPLEX) ! Zero mean flow!
      end if
      !$OMP END CRITICAL
   end do
   !$OMP END PARALLEL DO

This is my sanity check for the loop iterations being independent, because each of them is wrapped in a critical region, so they will be run in a random order, but each one at a time. Yet, this actually breaks the code and the nusselt number quickly explodes into a NaN. I believe this means that the loop iterations are not independent, but I can't quite make out why? It seems like tmp_uy should be the only private variable. I believe the problem is coming from the calc_vi call, because if I make phi a thread private variable, then the code doesn't explode, but the nusselt number is off by a small amount, which makes sense because really this should be a shared variable. But since phi is only accessed at the phi(:,it) slice, it seems like each iteration should be independent?

michaelneuder commented 3 years ago

I added a lot more parallelism today, including in the x direction of stages 1-3. We are seeing very great performance results:

with Nx=4800 Ny=3825 on m4.4xlarge (16 cores) running 16 threads serial overall timing: 46.162012183000115 (s) parallel overall timing: 5.7642300109998814 (s) speedup = 8.008x