Open michaelneuder opened 3 years ago
I should note that the timings above are for Nx=1600 and Ny=1275
Ok after coming back to this, now my timing results are looking much better! I wonder if this specific compute node I am on is better, or if I was doing something wrong before. Either way, we are now seeing some better speedups!
Here is a little table:
threads, timing, speedup 1, 0.52083734888583422, 1 2, 0.44693268812261522, 1.16 4, 0.37362689198926091, 1.39 8, 0.23520854488015175, 2.21
Generally, past this point the increasing thread count isn't actually useful.
Ok I am having trouble getting replicable results unfortunately. Now when I connect to the academic cluster, I get the performance below for the exact same experiment as the previous comment.
1, 0.51881068991497159, 1 2, 0.45359245501458645, 1.14 4, 0.57991017214953899, 0.89 8, 0.57748107495717704, 0.90
This is discouraging, as without replicability, we can't properly test or benchmark our implementations. So we need to sort this out ASAP.
Ok some really great results! I switched to AWS to sanity check what was happening with the performance and ran the same experiment with much better success! On a t2.2xlarge instance, for the loop 3 runtime I was able to get nearly perfect linear scaling!! Phew this is a huge relief.
threads, timing, speedup 1, 0.44721634100005758, 1 2, 0.22189785400007622, 2.01 4, 0.11443732100008219, 3.91 8, 0.0568768589999, 7.86
Even though this is a single loop, speeding it up actually dramatically impact the overall runtime of the code. For a single time step in serial, we have a runtime of 3.93, but with 8 threads I am getting 2.33, which is a 1.69x speedup! Pretty awesome for a single parallel loop.
I am also parallelizing loop 5 in the calc_explicit call. When using 8 threads, we see the timing drop from 0.11551769900006548 to 0.015365126999995482, which is a 7.5x speedup, and drops the overall timing of a single iteration to 1.98, which gives an overall speedup of 3.93/1.98 = 1.98x for the entire code!
Running for 50 iterations in serial we have the following timing output
real 3m34.483s user 3m29.935s sys 0m3.846s
and for parallel
real 1m51.155s user 3m38.685s sys 0m0.887s
So we are seeing a 214 (s) / 111 (s) = 1.93x speedup. I also verified correctness by checking the nusselt numbers and they are identical for both runs.
Ok I made each loop in calc explicit parallel and got the overall runtime down to
real 1m39.510s user 3m46.794s sys 0m0.849s
So we have 214 (s) / 99 (s) = 2.16x speedup!
OK, I am trying to parallelize other parts of the main loop besides calc_explicit, and am running into some weird behavior. It can be boiled down to the example below.
!$OMP PARALLEL DO num_threads(8) private(tmp_uy) schedule(dynamic)
do it = 1,Nx
!$OMP CRITICAL
! Solve for v
call calc_vi(tmp_uy, phi(:,it))
uy(:,it) = tmp_uy
! Solve for u
if (kx(it) /= 0.0_dp) then
!ux(:,it) = -CI*d1y(tmp_uy)/kx(it)
ux(:,it) = CI*d1y(tmp_uy)/kx(it)
else if (kx(it) == 0.0_dp) then
ux(:,it) = cmplx(0.0_dp, 0.0_dp, kind=C_DOUBLE_COMPLEX) ! Zero mean flow!
end if
!$OMP END CRITICAL
end do
!$OMP END PARALLEL DO
This is my sanity check for the loop iterations being independent, because each of them is wrapped in a critical region, so they will be run in a random order, but each one at a time. Yet, this actually breaks the code and the nusselt number quickly explodes into a NaN. I believe this means that the loop iterations are not independent, but I can't quite make out why? It seems like tmp_uy
should be the only private variable. I believe the problem is coming from the calc_vi
call, because if I make phi
a thread private variable, then the code doesn't explode, but the nusselt number is off by a small amount, which makes sense because really this should be a shared variable. But since phi
is only accessed at the phi(:,it)
slice, it seems like each iteration should be independent?
I added a lot more parallelism today, including in the x direction of stages 1-3. We are seeing very great performance results:
with Nx=4800 Ny=3825 on m4.4xlarge (16 cores) running 16 threads serial overall timing: 46.162012183000115 (s) parallel overall timing: 5.7642300109998814 (s) speedup = 8.008x
A few timing results
num threads, timing of loop 3 1, 0.92242285818792880 2, 0.66375376703217626 4, 0.66562559804879129 8, 0.67282426310703158