Poor performance of amr-wind

Armin-Ha commented 2 weeks ago

Hi all,

I have conducted a strong-scaling analysis for a small spinup simulation with 256X256X256 mesh (10mX10mX5m) on two different machines with different CPUs. I am concerned about the poor performance observed in this analysis and would appreciate any insights on the matter. The corresponding input file and a log file are attached.

AMR_performance log.txt spinup.txt

Best regards, Armin

marchdf commented 2 weeks ago

Hi, thanks for reaching out! The short answer: strong scaling for a code that spends most of its time in linear solvers (as amr-wind does) can be very difficult in general.

However, there are certain things you could do to get the most performance out of your case:

compile with the profiler on (tiny-profile: AMR_WIND_ENABLE_TINY_PROFILE:ON) so you can see where it is spending it's time
vary the input parameter amr.blocking_factor from 4 to 32 by powers of 2
vary the input parameter amr.max_grid_size from 4 to 256 by powers of 2
increase the amount of work per core with a bigger cell count
use an intel compiler
try theading with openmp

We don't have good guidance for your case because we typically don't spend much time profiling at this scale. And these things vary quite a bit machine-to-machine. We do spend a lot of time thinking about code performance for GPUs and for O(10-100k) MPI ranks and have some better ideas for the types of numbers that will lead to better performance.

After all this, if the code is still not fast enough, then we need to start talking about linear solver input parameters.

asalmgren commented 2 weeks ago

@Armin-Ha -- just to follow up -- when you have a chance to re-run with the profiling on could you send us the output files (maybe just from 1, 4 and 8 cores). Also -- it looks like you do have the checkpointing and plotfiles on -- could you turn those off before re-running? And feel free to run fewer steps -- if I'm correctly reading your inputs file you are running over 14000 steps and writing plotfiles/checkpoints roughly 28 times? See what happens if you maybe run 100 steps for each case with all the I/O off? Thx

Armin-Ha commented 2 weeks ago

Hi Ann, Thanks for the reply. I conducted the simulations for around 50 steps, so no checkpointing or file plotting was involved except for the initial time. As you know, AMR-wind outputs the total time for every single time step, and I have taken the average of these times for 50 steps to exclude the writing time for the initial checkpoint and plot files. I will re-run the cases as you'd like and send you the output files. In addition, I will examine Marc's suggestions to improve the performance. Best regards, Armin

asalmgren commented 2 weeks ago

Sounds great, thanks! The most important thing for me to look at it will be the profiling results that are printed at the end of the run.

To clarify - did you run with 512 grids for each run? Or use fewer (larger) boxes at lower core counts?

Ann Almgren Senior Scientist; Dept. Head, Applied Mathematics Pronouns: she/her/hers

On Thu, Jun 13, 2024 at 1:50 AM Armin-Ha @.***> wrote:

Hi Ann, Thanks for the reply. I conducted the simulations only for around 50 steps, so no checkpointing or fileplotting were involved except for the initial time. As you know, AMR-wind outputs the total time for every single time step, and I have taken the average of these times for 50 steps to exclude the writing time for the initial checkpoint and plot files. I will re-run the cases as you wish and provide you with the output files. In addition, I will examine Marc's suggestions to improve the performance.

— Reply to this email directly, view it on GitHub https://github.com/Exawind/amr-wind/issues/1097#issuecomment-2165032789, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACRE6YSBFL2FKK5SPAITIGDZHFMM7AVCNFSM6AAAAABJDUVSM6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRVGAZTENZYHE . You are receiving this because you commented.Message ID: @.***>

Armin-Ha commented 1 week ago

I will provide you with the profiling results. Throughout the study, I maintained a fixed mesh of 256x256x256 cells with a fixed domain size of 2560x2560x1280 m3. The only variable I modified among different simulations was the number of cores.

Best regards, Armin

lawrenceccheung commented 1 week ago

Hi @Armin-Ha,

For comparison, here are some strong scaling results of AMR-Wind that we've observed (the plots are time per timestep, which can be converted to the speedups calculated). This is on a 512 x 512 x 512 ABL case using CPU's and GPU's of the Frontier cluster.

The details of the hardware are here: https://docs.olcf.ornl.gov/systems/frontier_user_guide.html#system-overview, but the CPU's are AMD 3rd Gen EPYC processors. Let me know if you have any questions.

Cheers,

Lawrence

Armin-Ha commented 5 days ago

Hi @asalmgren and @lawrenceccheung,

Sorry for my late reply, and thanks for sharing the strong scaling results of AMR-Wind, which appear to be reasonably linear on AMD 3rd Gen EPYC processors. I would appreciate it if you could provide me with the input file used for this analysis.

I have replicated the analysis for the small spinup simulation with 256X256X256 mesh (10mX10mX5m) on Intel Xeon W-2145. The corresponding log files, which include the profiling outcomes, are attached.

log_1cores.txt log_2cores.txt log_4cores.txt log_8cores.txt

Best regards, Armin

marchdf commented 3 days ago

Thanks for the update. @lawrenceccheung do you have the input file for Armin to try?

I am running some local tests on my machine to see if there are better settings for your specific case. I will be out for the next week or so though.

lawrenceccheung commented 3 days ago

Hi @Armin-Ha,

Yes, you can try running the 512x512x512 that I used here: https://github.com/lawrenceccheung/ALCC_Frontier_WindFarm/blob/main/precursor/scaling/Baseline_level0/MedWS_LowTI_precursor1.inp. Just set time.max_step or time.stop_time to something small to get a few iterations for the purposes of timing.

Lawrence

Exawind / amr-wind

Poor performance of amr-wind #1097