Open syamajala opened 1 month ago
You're running with the hijack or without?
With hijack. For cuda at least it is not possible to run without hijack because of #1059.
Can you do runs where you capture both Legion Prof and Nsight profiles from the same run? Just need to do that on sapling for 1 and 4 ranks.
I only profiled 4 timesteps instead of 200.
1 rank legion prof: https://legion.stanford.edu/prof-viewer/?url=https://sapling2.stanford.edu/~seshu/s3d_ammonia/sapling/pwave_x_1_ammonia_1ranks/legion_prof/
1 rank nsys profile: http://sapling2.stanford.edu/~seshu/s3d_ammonia/sapling/pwave_x_1_ammonia_1ranks/run/
2 ranks legion prof: https://legion.stanford.edu/prof-viewer/?url=https://sapling2.stanford.edu/~seshu/s3d_ammonia/sapling/pwave_x_1_ammonia_2ranks/legion_prof/
2 ranks nsys profile: http://sapling2.stanford.edu/~seshu/s3d_ammonia/sapling/pwave_x_1_ammonia_2ranks/run/
4 ranks legion prof: https://legion.stanford.edu/prof-viewer/?url=https://sapling2.stanford.edu/~seshu/s3d_ammonia/sapling/pwave_x_1_ammonia_4ranks/legion_prof/
4 ranks nsys profile: http://sapling2.stanford.edu/~seshu/s3d_ammonia/sapling/pwave_x_1_ammonia_4ranks/run/
What do you see happening in the Nsight profiles for calls into the CUDA driver?
cuEventQuery is the big thing thats jumping out I guess? Theres millions of calls to it and its taking 52.5% of the time (3.877s) at 1 rank, 49.5% at 2 ranks (3.129s), 41.2% at 4 ranks (2.159s).
The next thing after that is cuMemcpyAsync. Its getting ~2x faster when we add ranks.
Ive been running with -ll:bgwork 3
in all my configurations. Maybe I should do 1 background worker thread/gpu?
I guess when running 4 ranks I see the first two ranks with 3 threads each doing cuda calls, but then the second two ranks only have 1 thread each.
For 2 ranks, the first rank only has 2 threads, the second has 3 threads.
At 1 rank its just 3 threads.
cuEventQuery is the big thing thats jumping out I guess? Theres millions of calls to it and its taking 52.5% of the time (3.877s) at 1 rank, 49.5% at 2 ranks (3.129s), 41.2% at 4 ranks (2.159s).
Does the average time spent in a single cuEventQuery
call shrink as you add ranks?
The next thing after that is cuMemcpyAsync. Its getting ~2x faster when we add ranks.
Total time or the average time per API call?
The average is about the same for all of them for cuEventQuery, ~1.1us.
For cuMemcpyAsync that was total time, but the average is about the same.
I performed a scaling experiment where I varied the number of ranks of S3D from 1 to 8 and pushed each configuration as far as I could on Frontier. The results are here: http://sapling2.stanford.edu/~seshu/s3d_ammonia/pressure_wave_case_weak_scaling.html
One question this experiment resulted in is why the single node performance varies so much?
I have single node profiles from 1, 2, 4, 8 ranks/node on Frontier here:
1 rank: https://legion.stanford.edu/prof-viewer/?url=https://sapling2.stanford.edu/~seshu/s3d_ammonia/ranks/pwave_x_1_ammonia_1ranks/legion_prof/
2 ranks: https://legion.stanford.edu/prof-viewer/?url=https://sapling2.stanford.edu/~seshu/s3d_ammonia/ranks/pwave_x_1_ammonia_2ranks/legion_prof/
4 ranks: https://legion.stanford.edu/prof-viewer/?url=https://sapling2.stanford.edu/~seshu/s3d_ammonia/ranks/pwave_x_1_ammonia_4ranks/legion_prof/
8 ranks: https://legion.stanford.edu/prof-viewer/?url=https://sapling2.stanford.edu/~seshu/s3d_ammonia/ranks/pwave_x_1_ammonia_8ranks/legion_prof/
The hypothesis is that there is contention in the rocm driver. So I performed the single node experiment again on sapling. It appears the results are very similar.
Profiles from sapling are here:
1 rank: https://legion.stanford.edu/prof-viewer/?url=https://sapling2.stanford.edu/~seshu/s3d_ammonia/sapling/pwave_x_1_ammonia_1rank/legion_prof/
2 ranks: https://legion.stanford.edu/prof-viewer/?url=https://sapling2.stanford.edu/~seshu/s3d_ammonia/sapling/pwave_x_1_ammonia_2ranks/legion_prof/
4 ranks: https://legion.stanford.edu/prof-viewer/?url=https://sapling2.stanford.edu/~seshu/s3d_ammonia/sapling/pwave_x_1_ammonia_4ranks/legion_prof/
@lightsighter what should i do next? I could try running in nsight. I was also considering increasing the problem size.