Open manfredo89 opened 6 years ago
@manfredo89
Hi marcos, for point n°2 I did not understand well, I'll test commenting the last part. For point n°1 I think it's probably both. The fact that the CPU in underutilized signals that some threads have more computational burden than others and hence are left waiting for synchronization. Still it's kind of hard to square this with the allometry and with the fact that within the same simulation there are such large differences.
Hi guys, I recently had a discussion with @mpaiao, @femeunier and @larsonej about the performance of ED. We all have experienced serious slowdowns but at least for me the cause is still unclear. In my case most of the trouble came since I switched to IALLOM=3. As seen in Fig.1, IALLOM=2 has very reasonable times when compared to IALLOM=3 (Fig.4)
Fig1: 300 y of sim with 24 patches on a 24 cores CPU with IALLOM=2
I am not sure why IALLOM=3 does so much worse than IALLOM=2. Is is bit of a surprise because what the different allometries boil down to is just different numbers. Also when using IALLOM=3 the CPU time can vary by more than an order of magnitude within the same run (see Fig4).
The Vtune profiler seems to point at load imbalance in the rk4 solver. This is an openMP loop that runs the per patch operations in parallel. The result of the comparison for IALLOM=2 (dark blue) and IALLOM=3 (light blue) shows that for IALLOM=2 substantially more time is spent on more cores (this is a 12 patches sim run on 12 cores)
Fig2: CPU usage in OpenMP region for IALLOM=2 (dark blue) and IALLOM=3 (light blue)
When looking at the hotspot analysis it looks like the time difference is due to openMP barriers (basically at least one thread takes a lot more time than the others so the cores have to wait).
Fig3: Caller/Callee analysis with time differences for IALLOM=2 and IALLOM=3
Initially I thought that one possible reason was that the maximum number of allowed patches is actually a soft limit. That is during the simulation you can in fact have more patches than the maximum you have allowed. So for example if you set MAXPATCH=24 and you run on 24 cores you should have the ideal distribution of the work. However if it turns out that the patches are 25, you will be forced to run a second loop for the one spare patch. So I tried to plot NPATCHES and the CPU time in two different graphs to look if there was a correlation. Unfortunately when looking at Fig4 and Fig5 it is clear that this is not the root of the issue (at least not the main one; I am sure this could still be a source of performance degradation when everything else is OK).
Fig4: CPU time per month of output with 24 patches on 24 cores Fig5: Npatches during the same run of Fig4
To resolve the problem @mpaiao suggested to change
w_diam = 1.d-2 * cbrt8(dble(dbh2vol(height,dbh,ipft)))
tow_diam = 5.d-2 * cbrt8(dble(dbh2vol(height,dbh,ipft)))
at https://github.com/EDmodel/ED2/blob/master/ED/src/dynamics/canopy_struct_dynamics.f90#L4272 To be honest I am not sure what the effect of this change would be but after testing it I noticed that it does not change much.
So to get to the point I would like to know if anyone has a clue on why this is happening, or in general if you are getting the same performance and with which allometry/setup.