Exawind / amr-wind

AMReX-based structured wind solver
https://exawind.github.io/amr-wind
Other
106 stars 83 forks source link

Speed up sampling IO (netcdf) #971

Closed marchdf closed 7 months ago

marchdf commented 7 months ago

I noticed that I can go from 1.7s per step to 7s per step on the steps where there is sampling IO (netcdf). This is not good. Only the IO proc does sampling output which is a bottleneck. We might be able to fix that by distributing each sampling IO to a different proc.

rybchuk commented 7 months ago

I've run a few simulations where I run with 300+ sampling groups, and I've definitely noticed a big slow down when saving out all that data. This would be a nice feature to see added!

lawrenceccheung commented 7 months ago

This is also a problem for the ABL statistics output. All of the data gets sent back to one rank for output, and it becomes a bottleneck.

In one example, I'm setting ABL.stats_output_frequency=4, and every 4th timestep you see this massive increase in the time required for post:

WallClockTime: 800 Pre: 0.000309 Solve: 0.6511 Post: 15.6 Total: 16.29
WallClockTime: 801 Pre: 0.000324 Solve: 0.6709 Post: 0.0219 Total: 0.6931
WallClockTime: 802 Pre: 0.000309 Solve: 0.68 Post: 0.0446 Total: 0.7248
WallClockTime: 803 Pre: 0.000308 Solve: 0.6528 Post: 0.0213 Total: 0.6744
WallClockTime: 804 Pre: 0.000307 Solve: 0.6692 Post: 14.5 Total: 15.19
WallClockTime: 805 Pre: 0.000314 Solve: 0.6588 Post: 0.0322 Total: 0.6913
WallClockTime: 806 Pre: 0.000313 Solve: 0.6424 Post: 0.0202 Total: 0.6629
WallClockTime: 807 Pre: 0.000314 Solve: 0.6429 Post: 0.013 Total: 0.6562
WallClockTime: 808 Pre: 0.000313 Solve: 0.6588 Post: 14.4 Total: 15.05
WallClockTime: 809 Pre: 0.000321 Solve: 0.6566 Post: 0.0271 Total: 0.6841
WallClockTime: 810 Pre: 0.000307 Solve: 0.6729 Post: 0.0263 Total: 0.6995
WallClockTime: 811 Pre: 0.000308 Solve: 0.6724 Post: 0.026 Total: 0.6988
WallClockTime: 812 Pre: 0.000306 Solve: 0.6418 Post: 14.5 Total: 15.15

This is a case on Frontier, so it's unclear if things are particularly worse when moving data through the GPU's, but it is an issue for production runs.

Lawrence

marchdf commented 7 months ago

@lawrenceccheung can you post the file? Assuming it's just a pure ABL...

lawrenceccheung commented 7 months ago

I used this input file here: https://github.com/lawrenceccheung/ALCC_Frontier_WindFarm/blob/main/precursor/MedWS_LowTI/MedWS_LowTI_Offshore_Stable_Layout_20x20.inp I ran it on 200 nodes/1600 GPU's on Frontier. There's nothing special about it, it's just a single level ABL problem. Although it is bigger than other cases I ran on Frontier, so maybe that is why the ABL stats post is so egregious, but I've seen the same thing on other small cases.

Lawrence

hgopalan commented 7 months ago

We are usually interested in the statistics only at the end of the sampling period and may not require frequent writing of the data into a file. We may consider adding the frequency of computing the statistics (currently every time step) so that it is not computed at every time step. This can usually speed up the convergence of the mean and variance.

lawrenceccheung commented 7 months ago

There's a bit of a chicken-and-egg problem because sometimes in order to determine if an ABL has suitably converged to the right statistics (and thus end the ABL run), we need to have the ABL statistics output reliably and frequently enough.

However, just to be clear, the calculation of the mean temperature and velocity profiles is not an issue, since that is done at every timestep already and I don't see a performance penalty in that. It's the calculation of the higher order statistics, the zi height, and the subsequent output to netcdf files that is very slow: https://github.com/Exawind/amr-wind/blob/7291737434ca339ecc765355eab88ddd529ff68f/amr-wind/wind_energy/ABLStats.cpp#L209-L231

Lawrence

marchdf commented 7 months ago

Agreed with Lawrence. That's what I was targeting when creating this issue.

moprak-nrel commented 7 months ago

I poked around with changing the output format to native instead of netcdf and there seems to be a significant improvement when more cores are involved on my mac. I suspect a big part of this is the IO being faster on the native mode since it's not bottlenecked by just one IOProcessor node for the netcdf format. I am wondering that might be a solution for some of the users to avoid the egregious slowdown with sampling enabled. My example cases below are only with a max of 8 cores, but the Post: times should be a good indicator of the IO bottlenecks.

Quick example times from a demo case running on my macbook(sampling output frequency = 2) netcdf with 4 cores:

534:WallClockTime: 15 Pre: 0.0154 Solve: 6.515 Post: 0.0902 Total: 6.621
550:WallClockTime: 16 Pre: 0.0156 Solve: 6.41 Post: 0.502 Total: 6.928
566:WallClockTime: 17 Pre: 0.0153 Solve: 6.81 Post: 0.0917 Total: 6.917
582:WallClockTime: 18 Pre: 0.0157 Solve: 6.645 Post: 0.545 Total: 7.205

netcdf with 8 cores:

534:WallClockTime: 15 Pre: 0.00802 Solve: 4.207 Post: 0.0475 Total: 4.263
550:WallClockTime: 16 Pre: 0.00879 Solve: 4.316 Post: 0.671 Total: 4.996
566:WallClockTime: 17 Pre: 0.00946 Solve: 4.529 Post: 0.0449 Total: 4.583
582:WallClockTime: 18 Pre: 0.00891 Solve: 4.17 Post: 0.685 Total: 4.865

Note how the Post times with 8 cores is slower than with 4 cores. This disappears when using the native output format which is both faster, and does not slow down while increasing the number of cores.

native with 4 cores:

534:WallClockTime: 15 Pre: 0.0159 Solve: 6.473 Post: 0.0862 Total: 6.575
550:WallClockTime: 16 Pre: 0.0156 Solve: 6.489 Post: 0.473 Total: 6.977
566:WallClockTime: 17 Pre: 0.0159 Solve: 6.555 Post: 0.0873 Total: 6.658
582:WallClockTime: 18 Pre: 0.0166 Solve: 6.505 Post: 0.456 Total: 6.978

native with 8 cores:

534:WallClockTime: 15 Pre: 0.00857 Solve: 4.031 Post: 0.0471 Total: 4.087
550:WallClockTime: 16 Pre: 0.0121 Solve: 3.931 Post: 0.395 Total: 4.338
566:WallClockTime: 17 Pre: 0.00828 Solve: 4 Post: 0.0613 Total: 4.069
582:WallClockTime: 18 Pre: 0.0148 Solve: 3.866 Post: 0.397 Total: 4.278
moprak-nrel commented 7 months ago

Note however that this will not directly help with the ABLStats, where the currently available output formats are just netcdf and ascii. I need to poke around more to see if the majority of the slowdown is from the IO first (vs communications/computations).

moprak-nrel commented 7 months ago

Upon testing the ABLStats with and without the IO (I added a return before the netcdf IO starts), the majority of the time there seems to be from the computations.

I am inclined to close this issue given the native writer seems to fix the sampling outputs, and the bottlenecks with the ABLStats are not IO bound. Feel free to reopen if you have other things to try or if you notice a major slowdown with the native output format on larger runs.