Closed marchdf closed 7 months ago
I've run a few simulations where I run with 300+ sampling groups, and I've definitely noticed a big slow down when saving out all that data. This would be a nice feature to see added!
This is also a problem for the ABL statistics output. All of the data gets sent back to one rank for output, and it becomes a bottleneck.
In one example, I'm setting ABL.stats_output_frequency=4
, and every 4th timestep you see this massive increase in the time required for post:
WallClockTime: 800 Pre: 0.000309 Solve: 0.6511 Post: 15.6 Total: 16.29
WallClockTime: 801 Pre: 0.000324 Solve: 0.6709 Post: 0.0219 Total: 0.6931
WallClockTime: 802 Pre: 0.000309 Solve: 0.68 Post: 0.0446 Total: 0.7248
WallClockTime: 803 Pre: 0.000308 Solve: 0.6528 Post: 0.0213 Total: 0.6744
WallClockTime: 804 Pre: 0.000307 Solve: 0.6692 Post: 14.5 Total: 15.19
WallClockTime: 805 Pre: 0.000314 Solve: 0.6588 Post: 0.0322 Total: 0.6913
WallClockTime: 806 Pre: 0.000313 Solve: 0.6424 Post: 0.0202 Total: 0.6629
WallClockTime: 807 Pre: 0.000314 Solve: 0.6429 Post: 0.013 Total: 0.6562
WallClockTime: 808 Pre: 0.000313 Solve: 0.6588 Post: 14.4 Total: 15.05
WallClockTime: 809 Pre: 0.000321 Solve: 0.6566 Post: 0.0271 Total: 0.6841
WallClockTime: 810 Pre: 0.000307 Solve: 0.6729 Post: 0.0263 Total: 0.6995
WallClockTime: 811 Pre: 0.000308 Solve: 0.6724 Post: 0.026 Total: 0.6988
WallClockTime: 812 Pre: 0.000306 Solve: 0.6418 Post: 14.5 Total: 15.15
This is a case on Frontier, so it's unclear if things are particularly worse when moving data through the GPU's, but it is an issue for production runs.
Lawrence
@lawrenceccheung can you post the file? Assuming it's just a pure ABL...
I used this input file here: https://github.com/lawrenceccheung/ALCC_Frontier_WindFarm/blob/main/precursor/MedWS_LowTI/MedWS_LowTI_Offshore_Stable_Layout_20x20.inp I ran it on 200 nodes/1600 GPU's on Frontier. There's nothing special about it, it's just a single level ABL problem. Although it is bigger than other cases I ran on Frontier, so maybe that is why the ABL stats post is so egregious, but I've seen the same thing on other small cases.
Lawrence
We are usually interested in the statistics only at the end of the sampling period and may not require frequent writing of the data into a file. We may consider adding the frequency of computing the statistics (currently every time step) so that it is not computed at every time step. This can usually speed up the convergence of the mean and variance.
There's a bit of a chicken-and-egg problem because sometimes in order to determine if an ABL has suitably converged to the right statistics (and thus end the ABL run), we need to have the ABL statistics output reliably and frequently enough.
However, just to be clear, the calculation of the mean temperature and velocity profiles is not an issue, since that is done at every timestep already and I don't see a performance penalty in that. It's the calculation of the higher order statistics, the zi height, and the subsequent output to netcdf files that is very slow: https://github.com/Exawind/amr-wind/blob/7291737434ca339ecc765355eab88ddd529ff68f/amr-wind/wind_energy/ABLStats.cpp#L209-L231
Lawrence
Agreed with Lawrence. That's what I was targeting when creating this issue.
I poked around with changing the output format to native
instead of netcdf
and there seems to be a significant improvement when more cores are involved on my mac. I suspect a big part of this is the IO being faster on the native
mode since it's not bottlenecked by just one IOProcessor
node for the netcdf
format. I am wondering that might be a solution for some of the users to avoid the egregious slowdown with sampling enabled. My example cases below are only with a max of 8 cores, but the Post:
times should be a good indicator of the IO bottlenecks.
Quick example times from a demo case running on my macbook(sampling output frequency = 2)
netcdf
with 4 cores:
534:WallClockTime: 15 Pre: 0.0154 Solve: 6.515 Post: 0.0902 Total: 6.621
550:WallClockTime: 16 Pre: 0.0156 Solve: 6.41 Post: 0.502 Total: 6.928
566:WallClockTime: 17 Pre: 0.0153 Solve: 6.81 Post: 0.0917 Total: 6.917
582:WallClockTime: 18 Pre: 0.0157 Solve: 6.645 Post: 0.545 Total: 7.205
netcdf
with 8 cores:
534:WallClockTime: 15 Pre: 0.00802 Solve: 4.207 Post: 0.0475 Total: 4.263
550:WallClockTime: 16 Pre: 0.00879 Solve: 4.316 Post: 0.671 Total: 4.996
566:WallClockTime: 17 Pre: 0.00946 Solve: 4.529 Post: 0.0449 Total: 4.583
582:WallClockTime: 18 Pre: 0.00891 Solve: 4.17 Post: 0.685 Total: 4.865
Note how the Post
times with 8 cores is slower than with 4 cores. This disappears when using the native
output format which is both faster, and does not slow down while increasing the number of cores.
native
with 4 cores:
534:WallClockTime: 15 Pre: 0.0159 Solve: 6.473 Post: 0.0862 Total: 6.575
550:WallClockTime: 16 Pre: 0.0156 Solve: 6.489 Post: 0.473 Total: 6.977
566:WallClockTime: 17 Pre: 0.0159 Solve: 6.555 Post: 0.0873 Total: 6.658
582:WallClockTime: 18 Pre: 0.0166 Solve: 6.505 Post: 0.456 Total: 6.978
native
with 8 cores:
534:WallClockTime: 15 Pre: 0.00857 Solve: 4.031 Post: 0.0471 Total: 4.087
550:WallClockTime: 16 Pre: 0.0121 Solve: 3.931 Post: 0.395 Total: 4.338
566:WallClockTime: 17 Pre: 0.00828 Solve: 4 Post: 0.0613 Total: 4.069
582:WallClockTime: 18 Pre: 0.0148 Solve: 3.866 Post: 0.397 Total: 4.278
Note however that this will not directly help with the ABLStats, where the currently available output formats are just netcdf
and ascii
. I need to poke around more to see if the majority of the slowdown is from the IO first (vs communications/computations).
Upon testing the ABLStats with and without the IO (I added a return
before the netcdf
IO starts), the majority of the time there seems to be from the computations.
I am inclined to close this issue given the native
writer seems to fix the sampling outputs, and the bottlenecks with the ABLStats are not IO bound. Feel free to reopen if you have other things to try or if you notice a major slowdown with the native
output format on larger runs.
I noticed that I can go from 1.7s per step to 7s per step on the steps where there is sampling IO (netcdf). This is not good. Only the IO proc does sampling output which is a bottleneck. We might be able to fix that by distributing each sampling IO to a different proc.