StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
668 stars 146 forks source link

S3D: Subranks single node performance #1715

Open syamajala opened 1 month ago

syamajala commented 1 month ago

I performed a scaling experiment where I varied the number of ranks of S3D from 1 to 8 and pushed each configuration as far as I could on Frontier. The results are here: http://sapling2.stanford.edu/~seshu/s3d_ammonia/pressure_wave_case_weak_scaling.html

One question this experiment resulted in is why the single node performance varies so much?

I have single node profiles from 1, 2, 4, 8 ranks/node on Frontier here:

1 rank: https://legion.stanford.edu/prof-viewer/?url=https://sapling2.stanford.edu/~seshu/s3d_ammonia/ranks/pwave_x_1_ammonia_1ranks/legion_prof/

2 ranks: https://legion.stanford.edu/prof-viewer/?url=https://sapling2.stanford.edu/~seshu/s3d_ammonia/ranks/pwave_x_1_ammonia_2ranks/legion_prof/

4 ranks: https://legion.stanford.edu/prof-viewer/?url=https://sapling2.stanford.edu/~seshu/s3d_ammonia/ranks/pwave_x_1_ammonia_4ranks/legion_prof/

8 ranks: https://legion.stanford.edu/prof-viewer/?url=https://sapling2.stanford.edu/~seshu/s3d_ammonia/ranks/pwave_x_1_ammonia_8ranks/legion_prof/

The hypothesis is that there is contention in the rocm driver. So I performed the single node experiment again on sapling. It appears the results are very similar.

Profiles from sapling are here:

1 rank: https://legion.stanford.edu/prof-viewer/?url=https://sapling2.stanford.edu/~seshu/s3d_ammonia/sapling/pwave_x_1_ammonia_1rank/legion_prof/

2 ranks: https://legion.stanford.edu/prof-viewer/?url=https://sapling2.stanford.edu/~seshu/s3d_ammonia/sapling/pwave_x_1_ammonia_2ranks/legion_prof/

4 ranks: https://legion.stanford.edu/prof-viewer/?url=https://sapling2.stanford.edu/~seshu/s3d_ammonia/sapling/pwave_x_1_ammonia_4ranks/legion_prof/

@lightsighter what should i do next? I could try running in nsight. I was also considering increasing the problem size.

lightsighter commented 1 month ago

You're running with the hijack or without?

syamajala commented 1 month ago

With hijack. For cuda at least it is not possible to run without hijack because of #1059.

lightsighter commented 1 month ago

Can you do runs where you capture both Legion Prof and Nsight profiles from the same run? Just need to do that on sapling for 1 and 4 ranks.

syamajala commented 1 month ago

I only profiled 4 timesteps instead of 200.

1 rank legion prof: https://legion.stanford.edu/prof-viewer/?url=https://sapling2.stanford.edu/~seshu/s3d_ammonia/sapling/pwave_x_1_ammonia_1ranks/legion_prof/

1 rank nsys profile: http://sapling2.stanford.edu/~seshu/s3d_ammonia/sapling/pwave_x_1_ammonia_1ranks/run/

2 ranks legion prof: https://legion.stanford.edu/prof-viewer/?url=https://sapling2.stanford.edu/~seshu/s3d_ammonia/sapling/pwave_x_1_ammonia_2ranks/legion_prof/

2 ranks nsys profile: http://sapling2.stanford.edu/~seshu/s3d_ammonia/sapling/pwave_x_1_ammonia_2ranks/run/

4 ranks legion prof: https://legion.stanford.edu/prof-viewer/?url=https://sapling2.stanford.edu/~seshu/s3d_ammonia/sapling/pwave_x_1_ammonia_4ranks/legion_prof/

4 ranks nsys profile: http://sapling2.stanford.edu/~seshu/s3d_ammonia/sapling/pwave_x_1_ammonia_4ranks/run/

lightsighter commented 1 month ago

What do you see happening in the Nsight profiles for calls into the CUDA driver?

syamajala commented 1 month ago

cuEventQuery is the big thing thats jumping out I guess? Theres millions of calls to it and its taking 52.5% of the time (3.877s) at 1 rank, 49.5% at 2 ranks (3.129s), 41.2% at 4 ranks (2.159s).

The next thing after that is cuMemcpyAsync. Its getting ~2x faster when we add ranks.

syamajala commented 1 month ago

Ive been running with -ll:bgwork 3 in all my configurations. Maybe I should do 1 background worker thread/gpu?

syamajala commented 1 month ago

I guess when running 4 ranks I see the first two ranks with 3 threads each doing cuda calls, but then the second two ranks only have 1 thread each.

For 2 ranks, the first rank only has 2 threads, the second has 3 threads.

At 1 rank its just 3 threads.

lightsighter commented 1 month ago

cuEventQuery is the big thing thats jumping out I guess? Theres millions of calls to it and its taking 52.5% of the time (3.877s) at 1 rank, 49.5% at 2 ranks (3.129s), 41.2% at 4 ranks (2.159s).

Does the average time spent in a single cuEventQuery call shrink as you add ranks?

The next thing after that is cuMemcpyAsync. Its getting ~2x faster when we add ranks.

Total time or the average time per API call?

syamajala commented 1 month ago

The average is about the same for all of them for cuEventQuery, ~1.1us.

For cuMemcpyAsync that was total time, but the average is about the same.