LBL-EESA / TECA

TECA, theToolkit for Extreme Climate Analysis, contains a collection of climate anlysis algorithms targetted at extreme event detection and analysis.
Other
55 stars 21 forks source link

WIP --- Temporal reduction profile #776

Open burlen opened 1 year ago

burlen commented 1 year ago

time each stage in the app. this may need work/cleanup before merge this info is already captured by the profiler.

burlen commented 1 year ago

Fastest

perlmutter_kernel_profiling_Fastest

Average

perlmutter_kernel_profiling_Average

Slowest

perlmutter_kernel_profiling_Slowest

Takeaway: The temporal reduction is much faster on the GPU. I/O is slower, and has a lot more variability when GPU is used. Timing captures everything within execute of each stage

burlen commented 1 year ago

varying steps per request (1 reduce thread, 1 writer thread)

steps_per_request_single_thread_1red_1wri

varying steps per request (4 reduce thread, 2 writer thread)

steps_per_request_single_thread_4red_2wri

burlen commented 1 year ago

round 2 steps per request

I redid the tests this time going to larger steps per request. The same patterns appear.

varying steps per request (1 reduce thread, 1 writer thread)

steps_per_request_single_thread_1red_1wri_789

varying steps per request (4 reduce thread, 2 writer thread)

steps_per_request_single_thread_4red_2wri_789

burlen commented 1 year ago

single node w. MPI

perlmutter_1_node_gpu_cpu_mpi_spr

burlen commented 1 year ago

new vs old

perlmutter_1_node_gpu_cpu_mpi_spr_strm

burlen commented 1 year ago

steps_per_request_single_thread_1red_1wri_cfs_scratch

burlen commented 1 year ago

steps_per_request_single_thread_cfs_lfs_nocomp steps_per_request_single_thread_cfs_lfs_comp

burlen commented 1 year ago

steps_per_request_single_thread_cfs_lfs_comp_nocomp