CliMA / Oceananigans.jl

🌊 Julia software for fast, friendly, flexible, ocean-flavored fluid dynamics on CPUs and GPUs
https://clima.github.io/OceananigansDocumentation/stable
MIT License
991 stars 194 forks source link

GPU and CPU Profiling #1912

Closed hennyg888 closed 1 year ago

hennyg888 commented 3 years ago

Here are some profiling results that were done on Satori with nvprof. This is a GPU profile of the nonhydrostatic model.

==104758== NVPROF is profiling process 104758, command: /nobackup/users/henryguo/projects/henry-test/julia-1.6.2/bin/julia --project benchmarkable_incompressible_model.jl

Oceananigans v0.60.0
Julia Version 1.6.2
Commit 1b93d53fc4 (2021-07-14 15:36 UTC)
Platform Info:
  OS: Linux (powerpc64le-unknown-linux-gnu)
  CPU: unknown
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, pwr9)
  GPU: Tesla V100-SXM2-32GB

CUDA toolkit 10.2.89, local installation
CUDA driver 10.2.0
NVIDIA driver 440.64.0

Libraries: 
- CUBLAS: 10.2.2
- CURAND: 10.1.2
- CUFFT: 10.1.2
- CUSOLVER: 10.3.0
- CUSPARSE: 10.3.1
- CUPTI: 12.0.0
- NVML: 10.0.0+440.64.0
- CUDNN: missing
- CUTENSOR: missing

Toolchain:
- Julia: 1.6.2
- LLVM: 11.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5
- Device capability support: sm_30, sm_32, sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75

2 devices:
  0: Tesla V100-SXM2-32GB (sm_70, 31.432 GiB / 31.749 GiB available)
  1: Tesla V100-SXM2-32GB (sm_70, 31.738 GiB / 31.749 GiB available)
nothing

[2021/07/30 10:27:44.108] INFO  Setting up benchmark: (GPU, Float64, 128)...
[2021/07/30 10:28:25.970] INFO  warming up
[2021/07/30 10:29:55.456] WARN  Calling CUDA.@profile only informs an external profiler to start.
The user is responsible for launching Julia under a CUDA profiler.

It is recommended to use Nsight Systems, which supports interactive profiling:
$ nsys launch julia -@-> /home/henryguo/.julia/packages/CUDA/lwSps/lib/cudadrv/profile.jl:71
[2021/07/30 10:29:58.016] INFO  done profiling (GPU, Float64, 128)
==104758== Profiling application: /nobackup/users/henryguo/projects/henry-test/julia-1.6.2/bin/julia --project benchmarkable_incompressible_model.jl
==104758== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   12.29%  502.36us         5  100.47us  94.015us  103.42us  _Z25julia_gpu_ab2_step_field_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE20_gpu_ab2_step_field_16CompilerMetadataI10StaticSizeI15_128__128__128_E12DynamicCheckvv7NDRangeILi3ES5_I11_8__8__128_ES5_I11_16__16__1_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE5Int64S9_S8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEE
                    9.47%  386.91us         4  96.727us  88.672us  105.02us  void regular_fft<unsigned int=128, unsigned int=8, unsigned int=16, padding_t=1, twiddle_t=0, loadstore_modifier_t=2, layout_t=1, unsigned int, double>(kernel_arguments_t<unsigned int>)
                    6.69%  273.47us         5  54.694us  53.503us  56.800us  _Z33julia_gpu_store_field_tendencies_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE28_gpu_store_field_tendencies_16CompilerMetadataI10StaticSizeI15_128__128__128_E12DynamicCheckvv7NDRangeILi3ES5_I11_8__8__128_ES5_I11_16__16__1_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE22RegularRectilinearGridIS9_8PeriodicS12_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES15_IS9_EEEES8_IS9_Li3ES10_IS9_Li3ELi1EEE
                    4.45%  181.95us         2  90.975us  89.088us  92.863us  [CUDA memcpy DtoD]
                    4.44%  181.41us         1  181.41us  181.41us  181.41us  _Z39julia_gpu__pressure_correct_velocities_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE34_gpu__pressure_correct_velocities_16CompilerMetadataI10StaticSizeI15_128__128__128_E12DynamicCheckvv7NDRangeILi3ES5_I11_8__8__128_ES5_I11_16__16__1_EvvEE10NamedTupleI12__u___v___w_5TupleI11OffsetArrayI7Float64Li3E13CuDeviceArrayIS11_Li3ELi1EEES10_IS11_Li3ES12_IS11_Li3ELi1EEES10_IS11_Li3ES12_IS11_Li3ELi1EEEEE22RegularRectilinearGridIS11_8PeriodicS14_7BoundedS10_IS11_Li1E12StepRangeLenIS11_14TwicePrecisionIS11_ES17_IS11_EEEE5Int64S10_IS11_Li3ES12_IS11_Li3ELi1EEE
                    4.36%  178.18us         2  89.087us  87.199us  90.976us  void vector_fft<unsigned int=128, unsigned int=8, unsigned int=2, padding_t=6, twiddle_t=0, loadstore_modifier_t=2, layout_t=0, unsigned int, double>(kernel_arguments_t<unsigned int>)
                    4.30%  175.77us         1  175.77us  175.77us  175.77us  julia_broadcast_kernel_12145(CuKernelContext, CuDeviceArray<Complex<Float64>, int=3, int=1>, Broadcasted<void, Tuple<OneTo<Int64>, Broadcasted<Tuple>, Broadcasted<Tuple>>, _real, CuDeviceArray<Complex<Float64>, int=3, int=1, Extruded<CuDeviceArray<Complex<Float64>, int=3, int=1>, CuDeviceArray<Complex<Float64>, int=3, int=1, Bool, OneTo<Int64>, OneTo<Int64>>, CuDeviceArray<Complex<Float64>, int=3, int=1, Tuple, Tuple, Tuple>>>>, Tuple)
                    4.20%  171.81us         1  171.81us  171.81us  171.81us  _Z23julia_gpu_calculate_Gv_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu_calculate_Gv_16CompilerMetadataI10StaticSizeI15_128__128__128_E12DynamicCheckvv7NDRangeILi3ES5_I11_8__8__128_ES5_I11_16__16__1_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE22RegularRectilinearGridIS9_8PeriodicS12_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES15_IS9_EEEE19CenteredSecondOrdervvv8BuoyancyI16SeawaterBuoyancyIS9_21LinearEquationOfStateIS9_EvvE10ZDirectionE10NamedTupleI23__velocities___tracers_5TupleIS21_I12__u___v___w_S22_I9ZeroFieldS23_S23_EES21_I8__T___S_S22_IS23_S23_EEEES21_I12__u___v___w_S22_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEES21_I8__T___S_S22_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEEvS21_I20__u___v___w___T___S_S22_I12_zeroforcingS24_S24_S24_S24_EES8_IS9_Li3ES10_IS9_Li3ELi1EEES21_I27__time___iteration___stage_S22_IS9_5Int64S25_EE
                    4.16%  170.05us         2  85.023us  84.703us  85.343us  void scal_kernel_val<double2, double>(cublasScalParamsVal<double2, double>)
                    4.11%  167.94us         1  167.94us  167.94us  167.94us  _Z23julia_gpu_calculate_Gu_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu_calculate_Gu_16CompilerMetadataI10StaticSizeI15_128__128__128_E12DynamicCheckvv7NDRangeILi3ES5_I11_8__8__128_ES5_I11_16__16__1_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE22RegularRectilinearGridIS9_8PeriodicS12_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES15_IS9_EEEE19CenteredSecondOrdervvv8BuoyancyI16SeawaterBuoyancyIS9_21LinearEquationOfStateIS9_EvvE10ZDirectionE10NamedTupleI23__velocities___tracers_5TupleIS21_I12__u___v___w_S22_I9ZeroFieldS23_S23_EES21_I8__T___S_S22_IS23_S23_EEEES21_I12__u___v___w_S22_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEES21_I8__T___S_S22_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEEvS21_I20__u___v___w___T___S_S22_I12_zeroforcingS24_S24_S24_S24_EES8_IS9_Li3ES10_IS9_Li3ELi1EEES21_I27__time___iteration___stage_S22_IS9_5Int64S25_EE
                    3.65%  149.28us         1  149.28us  149.28us  149.28us  _Z23julia_gpu_calculate_Gc_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu_calculate_Gc_16CompilerMetadataI10StaticSizeI15_128__128__128_E12DynamicCheckvv7NDRangeILi3ES5_I11_8__8__128_ES5_I11_16__16__1_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE22RegularRectilinearGridIS9_8PeriodicS12_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES15_IS9_EEEE3ValILi2EE19CenteredSecondOrderv8BuoyancyI16SeawaterBuoyancyIS9_21LinearEquationOfStateIS9_EvvE10ZDirectionE10NamedTupleI23__velocities___tracers_5TupleIS22_I12__u___v___w_S23_I9ZeroFieldS24_S24_EES22_I8__T___S_S23_IS24_S24_EEEES22_I12__u___v___w_S23_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEES22_I8__T___S_S23_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEEv12_zeroforcingS22_I27__time___iteration___stage_S23_IS9_5Int64S26_EE
                    3.65%  148.99us         1  148.99us  148.99us  148.99us  _Z23julia_gpu_calculate_Gc_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu_calculate_Gc_16CompilerMetadataI10StaticSizeI15_128__128__128_E12DynamicCheckvv7NDRangeILi3ES5_I11_8__8__128_ES5_I11_16__16__1_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE22RegularRectilinearGridIS9_8PeriodicS12_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES15_IS9_EEEE3ValILi1EE19CenteredSecondOrderv8BuoyancyI16SeawaterBuoyancyIS9_21LinearEquationOfStateIS9_EvvE10ZDirectionE10NamedTupleI23__velocities___tracers_5TupleIS22_I12__u___v___w_S23_I9ZeroFieldS24_S24_EES22_I8__T___S_S23_IS24_S24_EEEES22_I12__u___v___w_S23_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEES22_I8__T___S_S23_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEEv12_zeroforcingS22_I27__time___iteration___stage_S23_IS9_5Int64S26_EE
                    3.64%  148.93us         1  148.93us  148.93us  148.93us  _Z28julia_broadcast_kernel_1160115CuKernelContext13CuDeviceArrayI7ComplexI7Float64ELi3ELi1EE11BroadcastedIv5TupleI5OneToI5Int64ES5_IS6_ES5_IS6_EE2__S4_IS6_S3_I12CuArrayStyleILi3EEv5_realS4_IS3_IS8_ILi3EEvS7_S4_I8ExtrudedIS0_IS1_IS2_ELi3ELi1EES4_I4BoolS11_S11_ES4_IS6_S6_S6_EES10_IS0_IS1_IS2_ELi3ELi1EES4_IS11_S11_S11_ES4_IS6_S6_S6_EEEEEEEES6_
                    3.47%  141.73us        20  7.0860us  5.4400us  9.2480us  _Z27julia_broadcast_kernel_601215CuKernelContext8SubArrayI7Float64Li3E13CuDeviceArrayIS1_Li3ELi1EE5TupleI9UnitRangeI5Int64E5SliceI5OneToIS5_EES6_IS7_IS5_EEELifalseEE11BroadcastedIvS3_IS7_IS5_ES7_IS5_ES7_IS5_EE9_identityS3_I8ExtrudedIS0_IS1_Li3ES2_IS1_Li3ELi1EES3_IS4_IS5_ES6_IS7_IS5_EES6_IS7_IS5_EEELifalseEES3_I4BoolS11_S11_ES3_IS5_S5_S5_EEEES5_
                    3.44%  140.70us         1  140.70us  140.70us  140.70us  _Z58julia_gpu_calculate_pressure_source_term_fft_based_solver_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE53_gpu_calculate_pressure_source_term_fft_based_solver_16CompilerMetadataI10StaticSizeI15_128__128__128_E12DynamicCheckvv7NDRangeILi3ES5_I11_8__8__128_ES5_I11_16__16__1_EvvEE13CuDeviceArrayI7ComplexI7Float64ELi3ELi1EE22RegularRectilinearGridIS10_8PeriodicS12_7Bounded11OffsetArrayIS10_Li1E12StepRangeLenIS10_14TwicePrecisionIS10_ES16_IS10_EEEE5Int6410NamedTupleI12__u___v___w_5TupleIS14_IS10_Li3ES8_IS10_Li3ELi1EEES14_IS10_Li3ES8_IS10_Li3ELi1EEES14_IS10_Li3ES8_IS10_Li3ELi1EEEEE
                    3.20%  130.85us         1  130.85us  130.85us  130.85us  _Z23julia_gpu_calculate_Gw_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu_calculate_Gw_16CompilerMetadataI10StaticSizeI15_128__128__128_E12DynamicCheckvv7NDRangeILi3ES5_I11_8__8__128_ES5_I11_16__16__1_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE22RegularRectilinearGridIS9_8PeriodicS12_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES15_IS9_EEEE19CenteredSecondOrdervvv8BuoyancyI16SeawaterBuoyancyIS9_21LinearEquationOfStateIS9_EvvE10ZDirectionE10NamedTupleI23__velocities___tracers_5TupleIS21_I12__u___v___w_S22_I9ZeroFieldS23_S23_EES21_I8__T___S_S22_IS23_S23_EEEES21_I12__u___v___w_S22_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEES21_I8__T___S_S22_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEEvS21_I20__u___v___w___T___S_S22_I12_zeroforcingS24_S24_S24_S24_EES21_I27__time___iteration___stage_S22_IS9_5Int64S25_EE
                    3.13%  127.84us         1  127.84us  127.84us  127.84us  _Z38julia_gpu_update_hydrostatic_pressure_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE33_gpu_update_hydrostatic_pressure_16CompilerMetadataI10StaticSizeI10_128__128_E12DynamicCheckvv7NDRangeILi2ES5_I6_8__8_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE22RegularRectilinearGridIS9_8PeriodicS12_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES15_IS9_EEEE8BuoyancyI16SeawaterBuoyancyIS9_21LinearEquationOfStateIS9_EvvE10ZDirectionE10NamedTupleI8__T___S_5TupleIS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEE
                    2.88%  117.73us         1  117.73us  117.73us  117.73us  _Z28julia_gpu_permute_z_indices_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE23_gpu_permute_z_indices_16CompilerMetadataI10StaticSizeI15_128__128__128_E12DynamicCheckvv7NDRangeILi3ES5_I11_8__8__128_ES5_I11_16__16__1_EvvEE13CuDeviceArrayI7ComplexI7Float64ELi3ELi1EES8_IS9_IS10_ELi3ELi1EE22RegularRectilinearGridIS10_8PeriodicS12_7Bounded11OffsetArrayIS10_Li1E12StepRangeLenIS10_14TwicePrecisionIS10_ES16_IS10_EEEE
                    2.78%  113.44us         1  113.44us  113.44us  113.44us  _Z30julia_gpu_unpermute_z_indices_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE25_gpu_unpermute_z_indices_16CompilerMetadataI10StaticSizeI15_128__128__128_E12DynamicCheckvv7NDRangeILi3ES5_I11_8__8__128_ES5_I11_16__16__1_EvvEE13CuDeviceArrayI7ComplexI7Float64ELi3ELi1EES8_IS9_IS10_ELi3ELi1EE22RegularRectilinearGridIS10_8PeriodicS12_7Bounded11OffsetArrayIS10_Li1E12StepRangeLenIS10_14TwicePrecisionIS10_ES16_IS10_EEEE
                    2.61%  106.82us         1  106.82us  106.82us  106.82us  _Z28julia_broadcast_kernel_1172915CuKernelContext13CuDeviceArrayI7ComplexI7Float64ELi3ELi1EE11BroadcastedIv5TupleI5OneToI5Int64ES5_IS6_ES5_IS6_EE2__S4_IS3_I12CuArrayStyleILi3EEvS7_S4_I8ExtrudedIS0_IS1_IS2_ELi3ELi1EES4_I4BoolS10_S10_ES4_IS6_S6_S6_EEEES3_IS8_ILi3EEvS7_S4_IS9_IS0_IS2_Li3ELi1EES4_IS10_S10_S10_ES4_IS6_S6_S6_EES9_IS0_IS2_Li3ELi1EES4_IS10_S10_S10_ES4_IS6_S6_S6_EES9_IS0_IS2_Li3ELi1EES4_IS10_S10_S10_ES4_IS6_S6_S6_EEEEEES6_
                    2.23%  90.976us         1  90.976us  90.976us  90.976us  julia_broadcast_kernel_11876(CuKernelContext, CuDeviceArray<Complex<Float64>, int=3, int=1>, Broadcasted<void, Tuple<OneTo<Int64>, Broadcasted<Tuple>, Broadcasted<Tuple>>, __, CuDeviceArray<Complex<Float64>, int=3, int=1, Extruded<CuDeviceArray<Complex<Float64>, int=3, int=1>, CuDeviceArray<Complex<Float64>, int=3, int=1, Bool, OneTo<Int64>, OneTo<Int64>>, CuDeviceArray<Complex<Float64>, int=3, int=1, Tuple, Tuple, Tuple>>, Int64<CuDeviceArray<Complex<Float64>, int=3, int=1>, CuDeviceArray<Complex<Float64>, int=3, int=1, OneTo<Int64>, OneTo<Int64>, OneTo<Int64>>, CuDeviceArray<Complex<Float64>, int=3, int=1, Tuple, Tuple, Tuple>>>>, Tuple)
                    2.15%  87.840us        20  4.3920us  4.0320us  5.8240us  _Z27julia_broadcast_kernel_614315CuKernelContext8SubArrayI7Float64Li3E13CuDeviceArrayIS1_Li3ELi1EE5TupleI5SliceI5OneToI5Int64EE9UnitRangeIS6_ES4_IS5_IS6_EEELifalseEE11BroadcastedIvS3_IS5_IS6_ES5_IS6_ES5_IS6_EE9_identityS3_I8ExtrudedIS0_IS1_Li3ES2_IS1_Li3ELi1EES3_IS4_IS5_IS6_EES7_IS6_ES4_IS5_IS6_EEELifalseEES3_I4BoolS11_S11_ES3_IS6_S6_S6_EEEES6_
                    1.74%  70.912us         1  70.912us  70.912us  70.912us  _Z31julia_gpu_copy_pressure_kernel_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE26_gpu_copy_pressure_kernel_16CompilerMetadataI10StaticSizeI15_128__128__128_E12DynamicCheckvv7NDRangeILi3ES5_I11_8__8__128_ES5_I11_16__16__1_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEES10_I7ComplexIS9_ELi3ELi1EE
                    0.90%  36.800us         8  4.6000us  3.4560us  6.8480us  _Z28julia_gpu__fill_bottom_halo_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE23_gpu__fill_bottom_halo_16CompilerMetadataI10StaticSizeI10_128__128_E12DynamicCheckvv7NDRangeILi2ES5_I6_8__8_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE17BoundaryConditionI4FluxvE5Int64S13_
                    0.77%  31.424us         8  3.9280us  3.5200us  4.4480us  _Z25julia_gpu__fill_top_halo_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE20_gpu__fill_top_halo_16CompilerMetadataI10StaticSizeI10_128__128_E12DynamicCheckvv7NDRangeILi2ES5_I6_8__8_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE17BoundaryConditionI4FluxvE5Int64S13_
                    0.27%  11.136us         4  2.7840us  2.6560us  2.9440us  _Z36julia_gpu_set_top_bottom_w_velocity_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE31_gpu_set_top_bottom_w_velocity_16CompilerMetadataI10StaticSizeI10_128__128_E12DynamicCheckvv7NDRangeILi2ES5_I6_8__8_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE5Int6417BoundaryConditionI4OpenvE22RegularRectilinearGridIS9_8PeriodicS15_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES18_IS9_EEEE10NamedTupleI27__time___iteration___stage_5TupleIS9_S11_S11_EES19_I20__u___v___w___T___S_S20_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEE
                    0.13%  5.1510us         2  2.5750us  2.4310us  2.7200us  _Z23julia_gpu__apply_y_bcs_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu__apply_y_bcs_16CompilerMetadataI10StaticSizeI10_128__128_E12DynamicCheckvv7NDRangeILi2ES5_I6_8__8_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE5TupleI6CenterS12_S12_E22RegularRectilinearGridIS9_8PeriodicS14_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES17_IS9_EEEE17BoundaryConditionIS14_vES18_IS14_vE10NamedTupleI27__time___iteration___stage_S11_IS9_5Int64S20_EES19_I20__u___v___w___T___S_S11_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEE
                    0.12%  5.0560us         2  2.5280us  2.4640us  2.5920us  _Z23julia_gpu__apply_x_bcs_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu__apply_x_bcs_16CompilerMetadataI10StaticSizeI10_128__128_E12DynamicCheckvv7NDRangeILi2ES5_I6_8__8_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE5TupleI6CenterS12_S12_E22RegularRectilinearGridIS9_8PeriodicS14_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES17_IS9_EEEE17BoundaryConditionIS14_vES18_IS14_vE10NamedTupleI27__time___iteration___stage_S11_IS9_5Int64S20_EES19_I20__u___v___w___T___S_S11_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEE
                    0.12%  5.0240us         2  2.5120us  2.4320us  2.5920us  _Z23julia_gpu__apply_z_bcs_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu__apply_z_bcs_16CompilerMetadataI10StaticSizeI10_128__128_E12DynamicCheckvv7NDRangeILi2ES5_I6_8__8_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE5TupleI6CenterS12_S12_E22RegularRectilinearGridIS9_8PeriodicS14_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES17_IS9_EEEE17BoundaryConditionI4FluxvES18_IS19_vE10NamedTupleI27__time___iteration___stage_S11_IS9_5Int64S21_EES20_I20__u___v___w___T___S_S11_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEE
                    0.08%  3.0720us         1  3.0720us  3.0720us  3.0720us  _Z23julia_gpu__apply_y_bcs_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu__apply_y_bcs_16CompilerMetadataI10StaticSizeI10_128__128_E12DynamicCheckvv7NDRangeILi2ES5_I6_8__8_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE5TupleI6Center4FaceS12_E22RegularRectilinearGridIS9_8PeriodicS15_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES18_IS9_EEEE17BoundaryConditionIS15_vES19_IS15_vE10NamedTupleI27__time___iteration___stage_S11_IS9_5Int64S21_EES20_I20__u___v___w___T___S_S11_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEE
                    0.07%  2.7520us         1  2.7520us  2.7520us  2.7520us  _Z23julia_gpu__apply_z_bcs_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu__apply_z_bcs_16CompilerMetadataI10StaticSizeI10_128__128_E12DynamicCheckvv7NDRangeILi2ES5_I6_8__8_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE5TupleI6CenterS12_4FaceE22RegularRectilinearGridIS9_8PeriodicS15_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES18_IS9_EEEE17BoundaryConditionI4OpenvES19_IS20_vE10NamedTupleI27__time___iteration___stage_S11_IS9_5Int64S22_EES21_I20__u___v___w___T___S_S11_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEE
                    0.06%  2.6240us         1  2.6240us  2.6240us  2.6240us  _Z23julia_gpu__apply_x_bcs_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu__apply_x_bcs_16CompilerMetadataI10StaticSizeI10_128__128_E12DynamicCheckvv7NDRangeILi2ES5_I6_8__8_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE5TupleI6CenterS12_4FaceE22RegularRectilinearGridIS9_8PeriodicS15_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES18_IS9_EEEE17BoundaryConditionIS15_vES19_IS15_vE10NamedTupleI27__time___iteration___stage_S11_IS9_5Int64S21_EES20_I20__u___v___w___T___S_S11_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEE
                    0.06%  2.6240us         1  2.6240us  2.6240us  2.6240us  _Z23julia_gpu__apply_x_bcs_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu__apply_x_bcs_16CompilerMetadataI10StaticSizeI10_128__128_E12DynamicCheckvv7NDRangeILi2ES5_I6_8__8_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE5TupleI4Face6CenterS13_E22RegularRectilinearGridIS9_8PeriodicS15_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES18_IS9_EEEE17BoundaryConditionIS15_vES19_IS15_vE10NamedTupleI27__time___iteration___stage_S11_IS9_5Int64S21_EES20_I20__u___v___w___T___S_S11_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEE
                    0.06%  2.6240us         1  2.6240us  2.6240us  2.6240us  _Z23julia_gpu__apply_z_bcs_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu__apply_z_bcs_16CompilerMetadataI10StaticSizeI10_128__128_E12DynamicCheckvv7NDRangeILi2ES5_I6_8__8_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE5TupleI4Face6CenterS13_E22RegularRectilinearGridIS9_8PeriodicS15_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES18_IS9_EEEE17BoundaryConditionI4FluxvES19_IS20_vE10NamedTupleI27__time___iteration___stage_S11_IS9_5Int64S22_EES21_I20__u___v___w___T___S_S11_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEE
                    0.06%  2.5920us         1  2.5920us  2.5920us  2.5920us  _Z23julia_gpu__apply_x_bcs_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu__apply_x_bcs_16CompilerMetadataI10StaticSizeI10_128__128_E12DynamicCheckvv7NDRangeILi2ES5_I6_8__8_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE5TupleI6Center4FaceS12_E22RegularRectilinearGridIS9_8PeriodicS15_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES18_IS9_EEEE17BoundaryConditionIS15_vES19_IS15_vE10NamedTupleI27__time___iteration___stage_S11_IS9_5Int64S21_EES20_I20__u___v___w___T___S_S11_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEE
                    0.06%  2.5920us         1  2.5920us  2.5920us  2.5920us  _Z23julia_gpu__apply_y_bcs_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu__apply_y_bcs_16CompilerMetadataI10StaticSizeI10_128__128_E12DynamicCheckvv7NDRangeILi2ES5_I6_8__8_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE5TupleI6CenterS12_4FaceE22RegularRectilinearGridIS9_8PeriodicS15_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES18_IS9_EEEE17BoundaryConditionIS15_vES19_IS15_vE10NamedTupleI27__time___iteration___stage_S11_IS9_5Int64S21_EES20_I20__u___v___w___T___S_S11_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEE
                    0.06%  2.5600us         1  2.5600us  2.5600us  2.5600us  _Z23julia_gpu__apply_y_bcs_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu__apply_y_bcs_16CompilerMetadataI10StaticSizeI10_128__128_E12DynamicCheckvv7NDRangeILi2ES5_I6_8__8_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE5TupleI4Face6CenterS13_E22RegularRectilinearGridIS9_8PeriodicS15_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES18_IS9_EEEE17BoundaryConditionIS15_vES19_IS15_vE10NamedTupleI27__time___iteration___stage_S11_IS9_5Int64S21_EES20_I20__u___v___w___T___S_S11_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEE
                    0.06%  2.5600us         1  2.5600us  2.5600us  2.5600us  _Z23julia_gpu__apply_z_bcs_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu__apply_z_bcs_16CompilerMetadataI10StaticSizeI10_128__128_E12DynamicCheckvv7NDRangeILi2ES5_I6_8__8_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE5TupleI6Center4FaceS12_E22RegularRectilinearGridIS9_8PeriodicS15_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES18_IS9_EEEE17BoundaryConditionI4FluxvES19_IS20_vE10NamedTupleI27__time___iteration___stage_S11_IS9_5Int64S22_EES21_I20__u___v___w___T___S_S11_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEE
                    0.04%  1.7280us         1  1.7280us  1.7280us  1.7280us  [CUDA memcpy HtoD]
      API calls:   99.70%  486.48ms       100  4.8648ms  12.466us  484.80ms  cuLaunchKernel
                    0.09%  436.41us       177  2.4650us  1.2760us  8.2750us  cuStreamQuery
                    0.08%  389.95us       687     567ns     441ns  2.8200us  cuCtxGetCurrent
                    0.04%  185.14us       112  1.6530us  1.1680us  2.8160us  cuStreamWaitEvent
                    0.03%  139.03us         8  17.378us  15.304us  19.671us  cudaLaunchKernel
                    0.02%  117.02us        81  1.4440us  1.2440us  3.7220us  cuEventRecord
                    0.02%  95.922us        81  1.1840us     930ns  2.7580us  cuEventCreate
                    0.01%  43.715us         2  21.857us  19.755us  23.960us  cuMemcpyDtoDAsync
                    0.01%  42.747us        44     971ns     809ns  2.0130us  cuOccupancyMaxPotentialBlockSize
                    0.00%  14.712us         1  14.712us  14.712us  14.712us  cuMemcpyHtoDAsync
                    0.00%  4.1860us         1  4.1860us  4.1860us  4.1860us  cuPointerGetAttribute
                    0.00%  3.9880us         1  3.9880us  3.9880us  3.9880us  cuDeviceGetCount
                    0.00%  3.2390us         6     539ns     408ns     990ns  cudaGetErrorString
                    0.00%  2.6530us         4     663ns     483ns  1.0250us  cudaGetLastError
glwagner commented 3 years ago

Nice, thanks!

I'm pretty surprised that ab2_step_field! dominates the cost. ab2_step_field! is this simple function:

https://github.com/CliMA/Oceananigans.jl/blob/9ecddac3fe2666e05f21e51b81ec2c403094e5ea/src/TimeSteppers/quasi_adams_bashforth_2.jl#L121

which seems much cheaper than something like calculate_Gu!. What's going on?

I'm also noticing that function is a bit sketchy because it uses the type of χ to convert 1.5 and 0.5. This is fine if χ is a floating point number, but not otherwise... it should probably use eltype(U).

How did you run the profiler? Does it make sense to add a new profile directory to the source code (or maybe just add something to benchmark/)?

glwagner commented 3 years ago

Might be worthwhile to profile with timestepper=:RungeKutta3 as a sanity check, considering that this benchmark suggests a simple time-stepping function is 12% (!) of the cost.

Another thought --- we should probably benchmark "fully loaded" models that at least use WENO advection (and perhaps some turbulence closure?), since that's more realistic. I think most usage of NonhydrostaticModel also has one tracer, rather than two (someday, we should change that default...)

hennyg888 commented 3 years ago

I just edited an old benchmarkable incompressible model script to only have the model setup and time stepping. I did not profile from the start, and only profiled the time_step! function line. I feel like the profiles are more dependent on which system have which profiler, so it might make sense to just add a few simple scripts in benchmark that just consist of model setup and timestep and those can be called profiliables/benchmarkables.

francispoulin commented 3 years ago

@hennyg888 , when you have time, if you could add this line into model

timestepper = :RungeKutta3,

It should use a different time stepping scheme called RungeKutta3. This method should actually be slower but it would be of interest to see if it takes up more or less than the 12% that the default AdamsBashforth2 scheme uses.

glwagner commented 3 years ago

I just edited an old benchmarkable incompressible model script to only have the model setup and time stepping. I did not profile from the start, and only profiled the time_step! function line. I feel like the profiles are more dependent on which system have which profiler, so it might make sense to just add a few simple scripts in benchmark that just consist of model setup and timestep and those can be called profiliables/benchmarkables.

Ok! I can help with that.

christophernhill commented 3 years ago

@hennyg888 etc.. - this looks great. If we can get some scripts together then we can start automating some of this so we can see how things change, as well as tracking down anomalies.

CUDA.jl has a chart ( https://speed.juliagpu.org/changes/?exe=6&env=1&tre=50 ) that shows timing trends for different bits of the system. Not sure how they generate this!

christophernhill commented 3 years ago

This https://github.com/tobami/codespeed looks to be what CUDA.jl timings tracking is based on.

hennyg888 commented 3 years ago

Here's the code used for the profiling.

push!(LOAD_PATH, joinpath(@__DIR__, ".."))

#using BenchmarkTools
using CUDA
using Oceananigans
using Benchmarks

# Benchmark parameters

Arch = GPU
FT = Float64
N = 128

print_system_info()

# Define benchmarks

@info "Setting up benchmark: ($Arch, $FT, $N)..."

grid = RegularRectilinearGrid(FT, size=(N, N, N), extent=(1, 1, 1))
model = NonhydrostaticModel(architecture=Arch(), grid=grid)

@info "warming up"

time_step!(model, 1)

CUDA.@profile time_step!(model, 10000)

@info "done profiling ($Arch, $FT, $N)"
hennyg888 commented 3 years ago

CPU profile with script shown in #1914. Scroll to the right to see the specific line in the overhead file and function name and parameters. Sorted by ascending counts of backtrace samples. Flat format is used as the tree format showing hierarchy is wordy enough to have +3000 lines. Functions with sample counts less than 100 have been removed manually. Sample counts are taken in regular periods, the more a function shows up in a sample, the more counts it has, and the more time consuming it is.

 Count  Overhead File                                                                                                                                Line Function
 =====  ======== ====                                                                                                                                ==== ========
      101         0 @Oceananigans/src/TurbulenceClosures/abstract_isotropic_diffusivity_closure.jl                                                        37 ν_σᶠᶜᶠ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.Twic...
   101         0 @Oceananigans/src/TurbulenceClosures/abstract_isotropic_diffusivity_closure.jl                                                        37 overdub
   103         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      106 left_biased_αy₂(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, ...
   103         0 @Oceananigans/src/Solvers/fft_based_poisson_solver.jl                                                                                 52 solve_poisson_equation!(solver::Oceananigans.Solvers.FFTBasedPoissonSolver{CPU, RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Floa...
   104         0 @Oceananigans/src/Operators/interpolation_operators.jl                                                                                21 ℑxᶠᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.Twice...
   104         0 @Oceananigans/src/Advection/centered_fourth_order.jl                                                                                  22 symmetric_interpolate_xᶠᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision...
   104         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                       11 symmetric_interpolate_xᶠᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision...
   104         0 @Oceananigans/src/Advection/topologically_conditional_interpolation.jl                                                                36 _symmetric_interpolate_xᶠᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecisio...
   104         0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl                                                                         38 advective_momentum_flux_Wu(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision...
   104         0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl                                                                         38 overdub
   105         0 @Oceananigans/src/Operators/interpolation_operators.jl                                                                                21 ℑxᶠᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.Twice...
   105         0 @Oceananigans/src/Advection/centered_fourth_order.jl                                                                                  22 symmetric_interpolate_xᶠᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision...
   105         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                       11 symmetric_interpolate_xᶠᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision...
   105         0 @Oceananigans/src/Advection/topologically_conditional_interpolation.jl                                                                36 _symmetric_interpolate_xᶠᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecisio...
   105         0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl                                                                         29 advective_momentum_flux_Vu(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision...
   105         0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl                                                                         29 overdub
   105         0 @Oceananigans/src/Fields/abstract_field.jl                                                                                           200 setindex!(::Field{Center, Face, Center, CPU, OffsetArrays.OffsetArray{Float64, 3, Array{Float64, 3}}, RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Floa...
   106         0 @Base/subarray.jl                                                                                                                    276 getindex
   106         0 @Base/abstractarray.jl                                                                                                              1214 _getindex
   106         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      221 overdub
   106         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      112 right_biased_αx₀(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64},...
   106         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      168 right_biased_weno5_weights_x(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecisi...
   107       107 @Base/array.jl                                                                                                                       802 getindex
   111         0 @Oceananigans/src/Fields/abstract_field.jl                                                                                           231 fill_halo_regions!(::Field{Face, Center, Center, CPU, OffsetArrays.OffsetArray{Float64, 3, Array{Float64, 3}}, RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVe...
   111         0 @Oceananigans/src/BoundaryConditions/fill_halo_regions.jl                                                                             18 fill_halo_regions!(::NamedTuple{(:u, :v, :w, :b), Tuple{Field{Face, Center, Center, CPU, OffsetArrays.OffsetArray{Float64, 3, Array{Float64, 3}}, RegularRectilinearGrid{Float64, Periodic, Perio...
   111         0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl                                                                         74 advective_momentum_flux_Uw(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision...
   111         0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl                                                                         74 overdub
   112         0 @Oceananigans/src/Operators/difference_operators.jl                                                                                   23 δyᵃᶜᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.Twice...
   112         0 @Oceananigans/src/Operators/products_between_fields_and_grid_metrics.jl                                                               45 Az_ηᶠᶜᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.Twi...
   112         0 @Oceananigans/src/Operators/products_between_fields_and_grid_metrics.jl                                                               45 overdub
   112         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      105 left_biased_αy₁(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, ...
   113         0 @Base/abstractarray.jl                                                                                                              1170 getindex
   117         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      114 right_biased_αx₂(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64},...
   119         0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl                                                                         60 overdub
   120         0 @AbstractFFTs/src/definitions.jl                                                                                                     249 *(p::AbstractFFTs.ScaledPlan{ComplexF64, FFTW.cFFTWPlan{ComplexF64, 1, true, 3, Vector{Int64}}, Float64}, x::Array{ComplexF64, 3})
   122         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      118 right_biased_αy₂(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64},...
   126         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      146 overdub
   127         0 @Oceananigans/src/Operators/interpolation_operators.jl                                                                                24 overdub
   127         0 @Oceananigans/src/Advection/centered_fourth_order.jl                                                                                  25 overdub
   127         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                       12 overdub
   130         0 @Base/abstractarray.jl                                                                                                               984 copyto_unaliased!(deststyle::IndexCartesian, dest::SubArray{Float64, 3, Array{Float64, 3}, Tuple{Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}, Base.Slice{Base.OneTo{Int64}}}, false}, srcstyl...
   130         0 @Base/abstractarray.jl                                                                                                               950 copyto!
   130         0 @Base/broadcast.jl                                                                                                                   977 copyto!
   130         0 @Oceananigans/src/Fields/abstract_field.jl                                                                                           200 setindex!(::Field{Center, Center, Center, CPU, OffsetArrays.OffsetArray{Float64, 3, Array{Float64, 3}}, RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Fl...
   132         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      116 right_biased_αy₀(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64},...
   132         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      181 right_biased_weno5_weights_y(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecisi...
   133         0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl                                                                         96 overdub
   134         0 @Base/broadcast.jl                                                                                                                   984 macro expansion
   134         0 @Base/broadcast.jl                                                                                                                   983 copyto!
   135         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      172 overdub
   136         0 @Oceananigans/src/BoundaryConditions/fill_halo_regions.jl                                                                             30 fill_halo_regions!
   136         0 @Oceananigans/src/Advection/centered_fourth_order.jl                                                                                  12 ℑ³xᶠᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.Twic...
   136         0 @Oceananigans/src/Advection/centered_fourth_order.jl                                                                                  12 overdub
   136         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      102 left_biased_αx₂(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, ...
   139         0 @Oceananigans/src/Operators/difference_operators.jl                                                                                   26 δzᵃᵃᶜ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.Twice...
   141         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      101 left_biased_αx₁(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, ...
   150         0 @Oceananigans/src/Operators/interpolation_operators.jl                                                                                20 ℑxᶜᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.Twice...
   150         0 @Oceananigans/src/Operators/interpolation_operators.jl                                                                                20 overdub
   150         0 @Oceananigans/src/Advection/centered_fourth_order.jl                                                                                  21 symmetric_interpolate_xᶜᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision...
   150         0 @Oceananigans/src/Advection/centered_fourth_order.jl                                                                                  21 overdub
   150         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                       15 symmetric_interpolate_xᶜᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision...
   150         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                       15 overdub
   150         0 @Oceananigans/src/Advection/topologically_conditional_interpolation.jl                                                                36 _symmetric_interpolate_xᶜᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecisio...
   150         0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl                                                                         20 advective_momentum_flux_Uu(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision...
   150         0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl                                                                         20 overdub
   150       150 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl                                                                          ? overdub
   157         0 @Base/simdloop.jl                                                                                                                     77 macro expansion
   160         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      117 overdub
   160         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      182 overdub
   160         0 @Oceananigans/src/TurbulenceClosures/viscous_dissipation_operators.jl                                                                 26 overdub
   162       162 @FFTW/src/fft.jl                                                                                                                     466 unsafe_execute!
   162         0 @FFTW/src/fft.jl                                                                                                                     727 *
   166         0 @Base/math.jl                                                                                                                        918 overdub
   167         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      185 overdub
   169         0 @Oceananigans/src/Fields/abstract_field.jl                                                                                           190 getindex(::Field{Face, Center, Center, CPU, OffsetArrays.OffsetArray{Float64, 3, Array{Float64, 3}}, RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float...
   172         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      133 overdub
   172         0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl                                                                        127 overdub
   174         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      102 left_biased_αx₂(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, ...
   175         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      113 overdub
   175         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      169 overdub
   177         0 @Oceananigans/src/Models/NonhydrostaticModels/update_hydrostatic_pressure.jl                                                          14 macro expansion
   178         0 @Oceananigans/src/Advection/centered_fourth_order.jl                                                                                  13 overdub
   182         0 @Oceananigans/src/TurbulenceClosures/viscous_dissipation_operators.jl                                                                 33 overdub
   186         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      175 overdub
   187         0 @KernelAbstractions/src/extras/loopinfo.jl                                                                                            26 macro expansion
   197         0 @Oceananigans/src/Operators/difference_operators.jl                                                                                   11 δyᵃᶠᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.Twice...
   209         0 @Oceananigans/src/Operators/interpolation_operators.jl                                                                                21 overdub
   209         0 @Oceananigans/src/Advection/centered_fourth_order.jl                                                                                  22 overdub
   209         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                       11 overdub
   217         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      100 left_biased_αx₀(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, ...
   217         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      129 left_biased_weno5_weights_x(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecisio...
   223         0 @Oceananigans/src/Solvers/discrete_transforms.jl                                                                                     104 DiscreteTransform
   223         0 none                                                                                                                                   ? #31
   223         0 @Oceananigans/src/Solvers/fft_based_poisson_solver.jl                                                                                 49 solve_poisson_equation!(solver::Oceananigans.Solvers.FFTBasedPoissonSolver{CPU, RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Floa...
   223         0 @Base/array.jl                                                                                                                       702 collect_to_with_first!
   223         0 @Base/array.jl                                                                                                                       683 collect(itr::Base.Generator{Tuple{Oceananigans.Solvers.DiscreteTransform{FFTW.r2rFFTWPlan{ComplexF64, (5,), true, 3, Vector{Int64}}, Oceananigans.Solvers.Forward, CPU, RegularRectilinearGrid{Fl...
   226         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      104 overdub
   226         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      142 overdub
   230         0 @Oceananigans/src/Operators/difference_operators.jl                                                                                   11 overdub
   234         0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl                                                                         69 overdub
   238         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      106 overdub
   238         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      144 overdub
   248         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      136 overdub
   254         0 @Base/array.jl                                                                                                                       724 collect_to!(dest::Vector{Nothing}, itr::Base.Generator{Tuple{Oceananigans.Solvers.DiscreteTransform{FFTW.r2rFFTWPlan{ComplexF64, (5,), true, 3, Vector{Int64}}, Oceananigans.Solvers.Forward, CPU...
   263         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      105 overdub
   263         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      143 overdub
   264         0 @Base/broadcast.jl                                                                                                                   936 copyto!
   264         0 @Base/broadcast.jl                                                                                                                   894 materialize!
   264         0 @Base/broadcast.jl                                                                                                                   891 materialize!
   265       265 @Oceananigans/src/Operators/difference_operators.jl                                                                                    ? overdub
   267         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      188 overdub
   268         0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl                                                                         42 overdub
   268         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      112 overdub
   268         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      168 overdub
   278         0 @Oceananigans/src/Solvers/discrete_transforms.jl                                                                                     112 DiscreteTransform
   278         0 @Oceananigans/src/Solvers/fft_based_poisson_solver.jl                                                                                 66 solve_poisson_equation!(solver::Oceananigans.Solvers.FFTBasedPoissonSolver{CPU, RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Floa...
   279         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      118 overdub
   279         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      183 overdub
   280         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      216 left_biased_interpolate_yᵃᶠᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecisi...
   292         0 @Oceananigans/src/TurbulenceClosures/viscous_dissipation_operators.jl                                                                 19 overdub
   292         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      134 overdub
   296         0 @Oceananigans/src/Advection/topologically_conditional_interpolation.jl                                                                57 overdub
   301         0 @Oceananigans/src/TimeSteppers/quasi_adams_bashforth_2.jl                                                                            121 macro expansion
   306         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      114 overdub
   306         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      170 overdub
   308         0 @Base/array.jl                                                                                                                       678 collect(itr::Base.Generator{Tuple{Oceananigans.Solvers.DiscreteTransform{FFTW.r2rFFTWPlan{ComplexF64, (5,), true, 3, Vector{Int64}}, Oceananigans.Solvers.Forward, CPU, RegularRectilinearGrid{Fl...
   309         0 none                                                                                                                                   ? #33
   311         0 @Oceananigans/src/Fields/abstract_field.jl                                                                                           190 getindex(::Field{Center, Center, Center, CPU, OffsetArrays.OffsetArray{Float64, 3, Array{Float64, 3}}, RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Flo...
   311         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      116 overdub
   311         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      181 overdub
   316         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      147 overdub
   316       316 @FFTW/src/fft.jl                                                                                                                     496 unsafe_execute!
   316         0 @FFTW/src/fft.jl                                                                                                                     890 *
   326         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      100 overdub
   326         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      129 overdub
   334         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      173 overdub
   337         0 @Oceananigans/src/Operators/derivative_operators.jl                                                                                   95 ∂yᶜᶠᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.Twice...
   339         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      149 overdub
   343         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      231 right_biased_interpolate_yᵃᶠᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecis...
   344         0 @Oceananigans/src/Operators/derivative_operators.jl                                                                                   95 overdub
   356         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      101 overdub
   356         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      130 overdub
   366         0 @Oceananigans/src/Advection/topologically_conditional_interpolation.jl                                                                36 _left_biased_interpolate_yᵃᶠᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecis...
   366         0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl                                                                        115 overdub
   393         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      226 right_biased_interpolate_xᶠᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecis...
   393         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      211 left_biased_interpolate_xᶠᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecisi...
   397         0 @Oceananigans/src/Advection/topologically_conditional_interpolation.jl                                                                36 _left_biased_interpolate_xᶠᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecis...
   397         0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl                                                                         75 overdub
   423         0 @Oceananigans/src/Fields/abstract_field.jl                                                                                           200 overdub
   428         0 @Base/array.jl                                                                                                                       841 setindex!(::Array{Float64, 3}, ::Float64, ::Int64, ::Int64, ::Int64)
   428         0 @OffsetArrays/src/OffsetArrays.jl                                                                                                    430 overdub
   429         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      212 overdub
   430         0 @Base/promotion.jl                                                                                                                   324 /(::Float64, ::Int64)
   430         0 @Base/promotion.jl                                                                                                                   324 overdub
   431         0 @Oceananigans/src/Advection/topologically_conditional_interpolation.jl                                                                36 _right_biased_interpolate_yᵃᶠᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePreci...
   431         0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl                                                                        116 overdub
   437         0 @Oceananigans/src/Models/NonhydrostaticModels/pressure_correction.jl                                                                  40 macro expansion
   440         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      217 overdub
   446         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      102 overdub
   446         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      131 overdub
   458         0 @Base/array.jl                                                                                                                       841 overdub
   466         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      227 overdub
   479         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      211 left_biased_interpolate_xᶠᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecisi...
   494         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      226 right_biased_interpolate_xᶠᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecis...
   501         0 @Oceananigans/src/Solvers/discrete_transforms.jl                                                                                     136 apply_transform!
   517         0 @Oceananigans/src/Advection/topologically_conditional_interpolation.jl                                                                36 _right_biased_interpolate_xᶠᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePreci...
   517         0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl                                                                         76 overdub
   538         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      186 overdub
   551         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      216 left_biased_interpolate_yᵃᶠᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecisi...
   552         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      226 right_biased_interpolate_xᶠᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecis...
   555         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      135 overdub
   562         0 @Base/generator.jl                                                                                                                    47 iterate
   576         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      240 left_biased_interpolate_xᶜᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecisi...
   576         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      240 overdub
   576         0 @Oceananigans/src/Advection/topologically_conditional_interpolation.jl                                                                36 _left_biased_interpolate_xᶜᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecis...
   576         0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl                                                                         21 overdub
   590         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      244 right_biased_interpolate_xᶜᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecis...
   590         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      244 overdub
   590         0 @Oceananigans/src/Advection/topologically_conditional_interpolation.jl                                                                36 _right_biased_interpolate_xᶜᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePreci...
   590         0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl                                                                         22 overdub
   604         0 @Oceananigans/src/Solvers/solve_for_pressure.jl                                                                                        7 solve_for_pressure!
   604         0 @Oceananigans/src/Models/NonhydrostaticModels/pressure_correction.jl                                                                  20 calculate_pressure_correction!(model::NonhydrostaticModel{Oceananigans.TimeSteppers.QuasiAdamsBashforth2TimeStepper{Float64, NamedTuple{(:u, :v, :w, :b), Tuple{Field{Face, Center, Center, CPU, ...
   605         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      232 overdub
   624         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      216 left_biased_interpolate_yᵃᶠᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecisi...
   655         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      226 right_biased_interpolate_xᶠᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecis...
   665         0 @Oceananigans/src/TimeSteppers/quasi_adams_bashforth_2.jl                                                                             55 time_step!(model::NonhydrostaticModel{Oceananigans.TimeSteppers.QuasiAdamsBashforth2TimeStepper{Float64, NamedTuple{(:u, :v, :w, :b), Tuple{Field{Face, Center, Center, CPU, OffsetArrays.OffsetA...
   672         0 @Oceananigans/src/Fields/abstract_field.jl                                                                                           190 overdub
   677         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      231 right_biased_interpolate_yᵃᶠᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecis...
   681         0 @Base/array.jl                                                                                                                       802 getindex(::Array{Float64, 3}, ::Int64, ::Int64, ::Int64)
   681         0 @OffsetArrays/src/OffsetArrays.jl                                                                                                    409 overdub
   685         0 @Oceananigans/src/Advection/topologically_conditional_interpolation.jl                                                                36 _right_biased_interpolate_xᶠᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePreci...
   685         0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl                                                                        107 overdub
   690         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      174 overdub
   698         0 @Base/array.jl                                                                                                                       802 overdub
   706         0 @Oceananigans/src/Advection/topologically_conditional_interpolation.jl                                                                36 _left_biased_interpolate_yᵃᶠᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecis...
   706         0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl                                                                         30 overdub
   712         0 @Oceananigans/src/Advection/momentum_advection_operators.jl                                                                           18 _advective_momentum_flux_Ww(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecisio...
   712         0 @Oceananigans/src/Advection/momentum_advection_operators.jl                                                                           18 overdub
   725         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      231 right_biased_interpolate_yᵃᶠᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecis...
   726         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      231 right_biased_interpolate_yᵃᶠᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecis...
   731         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      241 left_biased_interpolate_yᵃᶜᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecisi...
   731         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      241 overdub
   731         0 @Oceananigans/src/Advection/topologically_conditional_interpolation.jl                                                                36 _left_biased_interpolate_yᵃᶜᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecis...
   731         0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl                                                                         57 overdub
   736         0 @Oceananigans/src/Operators/difference_operators.jl                                                                                   27 δzᵃᵃᶠ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.Twice...
   745         0 @Oceananigans/src/TimeSteppers/quasi_adams_bashforth_2.jl                                                                             44 time_step!##kw
   745         0 @Oceananigans/src/Simulations/run.jl                                                                                                  68 #ab2_or_rk3_time_step!#5
   745         0 @Oceananigans/src/Simulations/run.jl                                                                                                  68 ab2_or_rk3_time_step!##kw
   745         0 @Oceananigans/src/Simulations/run.jl                                                                                                 177 run!(sim::Simulation{NonhydrostaticModel{Oceananigans.TimeSteppers.QuasiAdamsBashforth2TimeStepper{Float64, NamedTuple{(:u, :v, :w, :b), Tuple{Field{Face, Center, Center, CPU, OffsetArrays.Offs...
   749         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      187 overdub
   754         0 @Oceananigans/src/Operators/difference_operators.jl                                                                                   27 overdub
   760         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      211 left_biased_interpolate_xᶠᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecisi...
   763         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      211 left_biased_interpolate_xᶠᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecisi...
   768         0 @Oceananigans/src/Advection/topologically_conditional_interpolation.jl                                                                36 _right_biased_interpolate_xᶠᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePreci...
   768         0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl                                                                         49 overdub
   781         0 @Oceananigans/src/Simulations/run.jl                                                                                                 127 run!(sim::Simulation{NonhydrostaticModel{Oceananigans.TimeSteppers.QuasiAdamsBashforth2TimeStepper{Float64, NamedTuple{(:u, :v, :w, :b), Tuple{Field{Face, Center, Center, CPU, OffsetArrays.Offs...
   781         0 @Base/boot.jl                                                                                                                        360 eval
   781         0 @Base/loading.jl                                                                                                                    1116 include_string(mapexpr::typeof(identity), mod::Module, code::String, filename::String)
   781         0 @Base/loading.jl                                                                                                                    1170 _include(mapexpr::Function, mod::Module, _path::String)
   781         0 @Base/Base.jl                                                                                                                        386 include(mod::Module, _path::String)
   781         0 @Base/client.jl                                                                                                                      285 exec_options(opts::Base.JLOptions)
   781         0 @Base/client.jl                                                                                                                      485 _start()
   796       796 @Cassette/src/context.jl                                                                                                               ? overdub
   821         0 @Oceananigans/src/Advection/topologically_conditional_interpolation.jl                                                                36 _right_biased_interpolate_yᵃᶠᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePreci...
   821         0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl                                                                         85 overdub
   860       860 @KernelAbstractions/src/compiler/contract.jl                                                                                          18 sub_float_contract
   860         0 @KernelAbstractions/src/compiler.jl                                                                                                   46 overdub
   873         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      148 overdub
   879         0 @Oceananigans/src/Operators/difference_operators.jl                                                                                   23 δyᵃᶜᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.Twice...
   903         0 @Oceananigans/src/Advection/topologically_conditional_interpolation.jl                                                                36 _right_biased_interpolate_yᵃᶠᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePreci...
   903         0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl                                                                         31 overdub
   911         0 @Oceananigans/src/Advection/topologically_conditional_interpolation.jl                                                                36 _left_biased_interpolate_xᶠᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecis...
   911         0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl                                                                        106 overdub
   921         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      245 right_biased_interpolate_yᵃᶜᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecis...
   921         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      245 overdub
   921         0 @Oceananigans/src/Advection/topologically_conditional_interpolation.jl                                                                36 _right_biased_interpolate_yᵃᶜᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePreci...
   921         0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl                                                                         58 overdub
   926         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      216 left_biased_interpolate_yᵃᶠᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecisi...
   940         0 @Oceananigans/src/Advection/topologically_conditional_interpolation.jl                                                                36 _left_biased_interpolate_xᶠᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecis...
   940         0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl                                                                         48 overdub
   941         0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl                                                                         10 upwind_biased_product(::Float64, ::Float64, ::Float64)
   941         0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl                                                                         10 overdub
  1000         0 @Oceananigans/src/Advection/momentum_advection_operators.jl                                                                           10 _advective_momentum_flux_Wu(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecisio...
  1000         0 @Oceananigans/src/Advection/momentum_advection_operators.jl                                                                           10 overdub
  1018         0 @Oceananigans/src/Advection/topologically_conditional_interpolation.jl                                                                36 _left_biased_interpolate_yᵃᶠᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecis...
  1018         0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl                                                                         84 overdub
  1022         0 @Oceananigans/src/Advection/momentum_advection_operators.jl                                                                           14 _advective_momentum_flux_Wv(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecisio...
  1022         0 @Oceananigans/src/Advection/momentum_advection_operators.jl                                                                           14 overdub
  1048         0 @Oceananigans/src/Operators/difference_operators.jl                                                                                   26 δzᵃᵃᶜ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.Twice...
  1051         0 @Oceananigans/src/Operators/difference_operators.jl                                                                                   26 δzᵃᵃᶜ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.Twice...
  1127         0 @Oceananigans/src/Advection/momentum_advection_operators.jl                                                                           16 _advective_momentum_flux_Uw(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecisio...
  1127         0 @Oceananigans/src/Advection/momentum_advection_operators.jl                                                                           16 overdub
  1127         0 @Oceananigans/src/Operators/difference_operators.jl                                                                                   20 δxᶜᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.Twice...
  1148      1148 @Cassette/src/context.jl                                                                                                             456 call
  1156         0 @Cassette/src/context.jl                                                                                                             454 fallback
  1156         0 @Cassette/src/overdub.jl                                                                                                             582 _overdub_fallback(::Any, ::Vararg{Any, N} where N)
  1156         0 @Cassette/src/overdub.jl                                                                                                             582 overdub
  1229      1229 @Oceananigans/src/Advection/topologically_conditional_interpolation.jl                                                                 ? __thread_run(tid::Int64, len::Int64, rem::Int64, obj::KernelAbstractions.Kernel{KernelAbstractions.CPU, KernelAbstractions.NDIteration.StaticSize{(16, 16)}, KernelAbstractions.NDIteration.Stati...
  1318         0 @Oceananigans/src/Operators/difference_operators.jl                                                                                   26 δzᵃᵃᶜ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.Twice...
  1363         0 @Oceananigans/src/Advection/momentum_advection_operators.jl                                                                            8 _advective_momentum_flux_Uu(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecisio...
  1363         0 @Oceananigans/src/Advection/momentum_advection_operators.jl                                                                            8 overdub
  1363         0 @Oceananigans/src/Operators/difference_operators.jl                                                                                   21 δxᶠᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.Twice...
  1372         0 @Oceananigans/src/Operators/difference_operators.jl                                                                                   21 overdub
  1602         0 @Oceananigans/src/Operators/difference_operators.jl                                                                                   20 δxᶜᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.Twice...
  1681      1681 @KernelAbstractions/src/compiler/contract.jl                                                                                          18 mul_float_contract
  1681         0 @KernelAbstractions/src/compiler.jl                                                                                                   47 overdub
  1714         0 @Oceananigans/src/Advection/momentum_advection_operators.jl                                                                            9 _advective_momentum_flux_Vu(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecisio...
  1714         0 @Oceananigans/src/Advection/momentum_advection_operators.jl                                                                            9 overdub
  1714         0 @Oceananigans/src/Operators/difference_operators.jl                                                                                   23 δyᵃᶜᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.Twice...
  1781         0 @Oceananigans/src/Advection/momentum_advection_operators.jl                                                                           13 _advective_momentum_flux_Vv(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecisio...
  1781         0 @Oceananigans/src/Advection/momentum_advection_operators.jl                                                                           13 overdub
  1781         0 @Oceananigans/src/Operators/difference_operators.jl                                                                                   24 δyᵃᶠᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.Twice...
  1791         0 @Oceananigans/src/Operators/difference_operators.jl                                                                                   24 overdub
  1797         0 @Oceananigans/src/Advection/momentum_advection_operators.jl                                                                           12 _advective_momentum_flux_Uv(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecisio...
  1797         0 @Oceananigans/src/Advection/momentum_advection_operators.jl                                                                           12 overdub
  1797         0 @Oceananigans/src/Operators/difference_operators.jl                                                                                   20 δxᶜᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.Twice...
  1970         0 @Oceananigans/src/Advection/momentum_advection_operators.jl                                                                           17 _advective_momentum_flux_Vw(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecisio...
  1970         0 @Oceananigans/src/Advection/momentum_advection_operators.jl                                                                           17 overdub
  1970         0 @Oceananigans/src/Operators/difference_operators.jl                                                                                   23 δyᵃᶜᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.Twice...
  2094         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      226 overdub
  2183         0 @Base/operators.jl                                                                                                                   560 +(::Float64, ::Float64, ::Float64)
  2262         0 @Base/operators.jl                                                                                                                   560 overdub
  2381         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      216 overdub
  2395         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      211 overdub
  2471         0 @Oceananigans/src/Advection/weno_fifth_order.jl                                                                                      231 overdub
  3033      3033 @KernelAbstractions/src/compiler/contract.jl                                                                                          18 add_float_contract
  3033         0 @KernelAbstractions/src/compiler.jl                                                                                                   45 overdub
  3688         0 @Oceananigans/src/Operators/difference_operators.jl                                                                                   26 overdub
  4069         0 @Oceananigans/src/Advection/tracer_advection_operators.jl                                                                             28 div_Uc(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.Twic...
  4069         0 @Oceananigans/src/Advection/tracer_advection_operators.jl                                                                             28 overdub
  4202         0 @Oceananigans/src/Models/NonhydrostaticModels/velocity_and_tracer_tendencies.jl                                                      186 overdub
  4469      4469 @Oceananigans/src/Advection/topologically_conditional_interpolation.jl                                                                 ? overdub
  4570         0 @Oceananigans/src/Advection/momentum_advection_operators.jl                                                                           57 div_𝐯u(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.Twic...
  4570         0 @Oceananigans/src/Advection/momentum_advection_operators.jl                                                                           57 overdub
  4651         0 @Oceananigans/src/Operators/difference_operators.jl                                                                                   20 overdub
  4791         0 @Oceananigans/src/Operators/difference_operators.jl                                                                                   23 overdub
  4979         0 @Oceananigans/src/Models/NonhydrostaticModels/velocity_and_tracer_tendencies.jl                                                       45 u_velocity_tendency(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float6...
  4979         0 @Oceananigans/src/Models/NonhydrostaticModels/velocity_and_tracer_tendencies.jl                                                       45 overdub
  5018         0 @Oceananigans/src/Advection/momentum_advection_operators.jl                                                                           87 div_𝐯w(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.Twic...
  5018         0 @Oceananigans/src/Advection/momentum_advection_operators.jl                                                                           87 overdub
  5247         0 @Oceananigans/src/Models/NonhydrostaticModels/velocity_and_tracer_tendencies.jl                                                      139 w_velocity_tendency(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float6...
  5247         0 @Oceananigans/src/Models/NonhydrostaticModels/velocity_and_tracer_tendencies.jl                                                      139 overdub
  5452         0 @Oceananigans/src/Advection/momentum_advection_operators.jl                                                                           72 div_𝐯v(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.Twic...
  5452         0 @Oceananigans/src/Advection/momentum_advection_operators.jl                                                                           72 overdub
  5694         0 @Oceananigans/src/Models/NonhydrostaticModels/velocity_and_tracer_tendencies.jl                                                       94 v_velocity_tendency(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float6...
  5694         0 @Oceananigans/src/Models/NonhydrostaticModels/velocity_and_tracer_tendencies.jl                                                       94 overdub
  9102      9102 @Base/float.jl                                                                                                                       335 /(::Float64, ::Float64)
  9102         0 @Base/float.jl                                                                                                                       335 overdub
 11777         0 @Oceananigans/src/Advection/topologically_conditional_interpolation.jl                                                                36 overdub
 21668         0 @KernelAbstractions/src/macros.jl                                                                                                    266 overdub
 21676         4 @KernelAbstractions/src/cpu.jl                                                                                                       157 __thread_run(tid::Int64, len::Int64, rem::Int64, obj::KernelAbstractions.Kernel{KernelAbstractions.CPU, KernelAbstractions.NDIteration.StaticSize{(16, 16)}, KernelAbstractions.NDIteration.Stati...
 22912         0 @KernelAbstractions/src/cpu.jl                                                                                                       130 __run(obj::KernelAbstractions.Kernel{KernelAbstractions.CPU, KernelAbstractions.NDIteration.StaticSize{(16, 16)}, KernelAbstractions.NDIteration.StaticSize{(128, 128)}, typeof(Oceananigans.Boun...
 22912         0 @KernelAbstractions/src/cpu.jl                                                                                                        22 (::KernelAbstractions.var"#33#34"{Tuple{KernelAbstractions.NoneEvent}, Nothing, typeof(KernelAbstractions.__run), Tuple{KernelAbstractions.Kernel{KernelAbstractions.CPU, KernelAbstractions.NDIt...
Total snapshots: 24177
francispoulin commented 3 years ago

Thanks @hennyg888 for sharing these results.

I presume this is with Profile and not ProfileView as we don't see the percentages spent on each function?

glwagner commented 3 years ago

ProfileView creates a flame graph:

https://github.com/timholy/ProfileView.jl

I haven't seen a text-based profile viewer that shows percentages like you describe @francispoulin .

francispoulin commented 3 years ago

ProfileView creates a flame graph:

https://github.com/timholy/ProfileView.jl

I haven't seen a text-based profile viewer that shows percentages like you describe @francispoulin .

Thanks for clarifying and sorry for my misunderstanding

francispoulin commented 3 years ago

Interesting that ab2 has 745 counts, which is much lower relatively than what we saw in the GPU case.

glwagner commented 3 years ago

No worries don't apologies! I made the same mistake after reading Hendrik Ranocha's blog post and seeing

image

But this is actually the output of benchmarking on individual components of the time-stepping scheme.

I think it'd be a good idea to setup similar microbenchmarks of the time-stepping components (update_state!, calculate_tendencies!, etc). This is not quite the same as profiling but yields slightly more precise and also more digestible information about timings and relative cost of things per time-step.

glwagner commented 3 years ago

@hennyg888 I think we need line info (not just file) to precisely interpret the profiling results?

glwagner commented 3 years ago

By the way, ProfileView.jl does not play nice with multithreaded programs so we can't use it. I tried StatProfilerHTML and liked it:

https://github.com/tkluck/StatProfilerHTML.jl

hennyg888 commented 3 years ago

@hennyg888 I think we need line info (not just file) to precisely interpret the profiling results?

If you scroll right in my big block of text you can see a column that shows the line number and function name in the file specified in the file column that's visible without scrolling. Please see the full file attached below. Might be easier to view or reformat than the embedded code block above. nonhydrostatic_profile_flat.txt

I tried to avoid flame graphs and go for something as close to percentages as I could so I went with the default output. I'll add in StatProfilerHTML.jl outputs as well since the flame graphs and html files do look very neat. In the very last row there's a total snapshots count of 24177. Dividing the counts shown in the left-most column by this number should give the percentage time spent on this line or in any functions executed by this line.

hennyg888 commented 3 years ago

Profiling results for the nonhydrostatic model on GPU with the script found in #1914. This was done on Satori, and with the WENO5 advection scheme and AB2 timestepper with the grid size being 128^3. Now it seems that timestepping takes less than 5% of the time and what should be taking up the largest chunks of time are doing so.

Oceananigans v0.60.0
Julia Version 1.6.2
Commit 1b93d53fc4 (2021-07-14 15:36 UTC)
Platform Info:
  OS: Linux (powerpc64le-unknown-linux-gnu)
  CPU: unknown
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, pwr9)
  GPU: Tesla V100-SXM2-32GB

CUDA toolkit 10.2.89, local installation
CUDA driver 10.2.0
NVIDIA driver 440.64.0

Libraries: 
- CUBLAS: 10.2.2
- CURAND: 10.1.2
- CUFFT: 10.1.2
- CUSOLVER: 10.3.0
- CUSPARSE: 10.3.1
- CUPTI: 12.0.0
- NVML: 10.0.0+440.64.0
- CUDNN: missing
- CUTENSOR: missing

Toolchain:
- Julia: 1.6.2
- LLVM: 11.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5
- Device capability support: sm_30, sm_32, sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75

2 devices:
  0: Tesla V100-SXM2-32GB (sm_70, 4.367 GiB / 31.749 GiB available)
  1: Tesla V100-SXM2-32GB (sm_70, 4.805 GiB / 31.749 GiB available)
nothing

[2021/08/05 12:11:43.425] INFO  Setting up benchmark: (GPU, Float64, 128)...
[2021/08/05 12:12:45.688] INFO  warming up
[2021/08/05 12:15:06.837] INFO  Simulation is stopping. Model iteration 1 has hit or exceeded simulation stop iteration 1.
[2021/08/05 12:15:07.841] INFO  Simulation is stopping. Model iteration 11 has hit or exceeded simulation stop iteration 11.
[2021/08/05 12:15:10.060] INFO  done profiling (GPU, Float64, 128)
==45925== Profiling application: /nobackup/users/henryguo/projects/henry-test/julia-1.6.2/bin/julia --project nonhydrostatic_profiler.jl
==45925== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   20.46%  17.966ms        10  1.7966ms  1.7946ms  1.7987ms  _Z23julia_gpu_calculate_Gv_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu_calculate_Gv_16CompilerMetadataI10StaticSizeI15_128__128__128_E12DynamicCheckvv7NDRangeILi3ES5_I11_8__8__128_ES5_I11_16__16__1_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE22RegularRectilinearGridIS9_8PeriodicS12_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES15_IS9_EEEE5WENO5vv20IsotropicDiffusivityI26ExplicitTimeDiscretizationS9_10NamedTupleI5__b__5TupleIS9_EEE8BuoyancyI14BuoyancyTracer10ZDirectionES19_I23__velocities___tracers_S20_IS19_I12__u___v___w_S20_I9ZeroFieldS24_S24_EES19_I5__b__S20_IS24_EEEES19_I12__u___v___w_S20_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEES19_I5__b__S20_IS8_IS9_Li3ES10_IS9_Li3ELi1EEEEEvS19_I16__u___v___w___b_S20_I12_zeroforcingS25_S25_S25_EES8_IS9_Li3ES10_IS9_Li3ELi1EEES19_I27__time___iteration___stage_S20_IS9_5Int64S26_EE
                   19.93%  17.500ms        10  1.7500ms  1.7462ms  1.7527ms  _Z23julia_gpu_calculate_Gu_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu_calculate_Gu_16CompilerMetadataI10StaticSizeI15_128__128__128_E12DynamicCheckvv7NDRangeILi3ES5_I11_8__8__128_ES5_I11_16__16__1_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE22RegularRectilinearGridIS9_8PeriodicS12_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES15_IS9_EEEE5WENO5vv20IsotropicDiffusivityI26ExplicitTimeDiscretizationS9_10NamedTupleI5__b__5TupleIS9_EEE8BuoyancyI14BuoyancyTracer10ZDirectionES19_I23__velocities___tracers_S20_IS19_I12__u___v___w_S20_I9ZeroFieldS24_S24_EES19_I5__b__S20_IS24_EEEES19_I12__u___v___w_S20_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEES19_I5__b__S20_IS8_IS9_Li3ES10_IS9_Li3ELi1EEEEEvS19_I16__u___v___w___b_S20_I12_zeroforcingS25_S25_S25_EES8_IS9_Li3ES10_IS9_Li3ELi1EEES19_I27__time___iteration___stage_S20_IS9_5Int64S26_EE
                   12.91%  11.333ms        10  1.1333ms  1.1288ms  1.1414ms  _Z23julia_gpu_calculate_Gw_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu_calculate_Gw_16CompilerMetadataI10StaticSizeI15_128__128__128_E12DynamicCheckvv7NDRangeILi3ES5_I11_8__8__128_ES5_I11_16__16__1_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE22RegularRectilinearGridIS9_8PeriodicS12_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES15_IS9_EEEE5WENO5vv20IsotropicDiffusivityI26ExplicitTimeDiscretizationS9_10NamedTupleI5__b__5TupleIS9_EEE8BuoyancyI14BuoyancyTracer10ZDirectionES19_I23__velocities___tracers_S20_IS19_I12__u___v___w_S20_I9ZeroFieldS24_S24_EES19_I5__b__S20_IS24_EEEES19_I12__u___v___w_S20_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEES19_I5__b__S20_IS8_IS9_Li3ES10_IS9_Li3ELi1EEEEEvS19_I16__u___v___w___b_S20_I12_zeroforcingS25_S25_S25_EES19_I27__time___iteration___stage_S20_IS9_5Int64S26_EE
                    8.89%  7.8028ms        10  780.28us  778.01us  783.13us  _Z23julia_gpu_calculate_Gc_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu_calculate_Gc_16CompilerMetadataI10StaticSizeI15_128__128__128_E12DynamicCheckvv7NDRangeILi3ES5_I11_8__8__128_ES5_I11_16__16__1_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE22RegularRectilinearGridIS9_8PeriodicS12_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES15_IS9_EEEE3ValILi1EE5WENO520IsotropicDiffusivityI26ExplicitTimeDiscretizationS9_10NamedTupleI5__b__5TupleIS9_EEE8BuoyancyI14BuoyancyTracer10ZDirectionES20_I23__velocities___tracers_S21_IS20_I12__u___v___w_S21_I9ZeroFieldS25_S25_EES20_I5__b__S21_IS25_EEEES20_I12__u___v___w_S21_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEES20_I5__b__S21_IS8_IS9_Li3ES10_IS9_Li3ELi1EEEEEv12_zeroforcingS20_I27__time___iteration___stage_S21_IS9_5Int64S27_EE
                    4.74%  4.1650ms        40  104.12us  97.055us  111.17us  _Z25julia_gpu_ab2_step_field_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE20_gpu_ab2_step_field_16CompilerMetadataI10StaticSizeI15_128__128__128_E12DynamicCheckvv7NDRangeILi3ES5_I11_8__8__128_ES5_I11_16__16__1_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE5Int64S9_S8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEE
                    4.17%  3.6600ms        40  91.499us  88.448us  95.808us  void regular_fft<unsigned int=128, unsigned int=8, unsigned int=16, padding_t=1, twiddle_t=0, loadstore_modifier_t=2, layout_t=1, unsigned int, double>(kernel_arguments_t<unsigned int>)
                    2.53%  2.2193ms        40  55.482us  54.623us  56.192us  _Z33julia_gpu_store_field_tendencies_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE28_gpu_store_field_tendencies_16CompilerMetadataI10StaticSizeI15_128__128__128_E12DynamicCheckvv7NDRangeILi3ES5_I11_8__8__128_ES5_I11_16__16__1_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE22RegularRectilinearGridIS9_8PeriodicS12_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES15_IS9_EEEES8_IS9_Li3ES10_IS9_Li3ELi1EEE
                    2.09%  1.8318ms        10  183.18us  180.90us  184.51us  _Z39julia_gpu__pressure_correct_velocities_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE34_gpu__pressure_correct_velocities_16CompilerMetadataI10StaticSizeI15_128__128__128_E12DynamicCheckvv7NDRangeILi3ES5_I11_8__8__128_ES5_I11_16__16__1_EvvEE10NamedTupleI12__u___v___w_5TupleI11OffsetArrayI7Float64Li3E13CuDeviceArrayIS11_Li3ELi1EEES10_IS11_Li3ES12_IS11_Li3ELi1EEES10_IS11_Li3ES12_IS11_Li3ELi1EEEEE22RegularRectilinearGridIS11_8PeriodicS14_7BoundedS10_IS11_Li1E12StepRangeLenIS11_14TwicePrecisionIS11_ES17_IS11_EEEE5Int64S10_IS11_Li3ES12_IS11_Li3ELi1EEE
                    2.07%  1.8141ms        20  90.705us  88.448us  92.864us  [CUDA memcpy DtoD]
                    2.05%  1.7988ms       190  9.4670us  6.3680us  14.592us  _Z27julia_broadcast_kernel_478815CuKernelContext8SubArrayI7Float64Li3E13CuDeviceArrayIS1_Li3ELi1EE5TupleI9UnitRangeI5Int64E5SliceI5OneToIS5_EES6_IS7_IS5_EEELifalseEE11BroadcastedIvS3_IS7_IS5_ES7_IS5_ES7_IS5_EE9_identityS3_I8ExtrudedIS0_IS1_Li3ES2_IS1_Li3ELi1EES3_IS4_IS5_ES6_IS7_IS5_EES6_IS7_IS5_EEELifalseEES3_I4BoolS11_S11_ES3_IS5_S5_S5_EEEES5_
                    2.03%  1.7807ms        20  89.036us  86.687us  91.328us  void vector_fft<unsigned int=128, unsigned int=8, unsigned int=2, padding_t=6, twiddle_t=0, loadstore_modifier_t=2, layout_t=0, unsigned int, double>(kernel_arguments_t<unsigned int>)
                    1.97%  1.7324ms        10  173.24us  171.10us  174.98us  julia_broadcast_kernel_20870(CuKernelContext, CuDeviceArray<Complex<Float64>, int=3, int=1>, Broadcasted<void, Tuple<OneTo<Int64>, Broadcasted<Tuple>, Broadcasted<Tuple>>, _real, CuDeviceArray<Complex<Float64>, int=3, int=1, Extruded<CuDeviceArray<Complex<Float64>, int=3, int=1>, CuDeviceArray<Complex<Float64>, int=3, int=1, Bool, OneTo<Int64>, OneTo<Int64>>, CuDeviceArray<Complex<Float64>, int=3, int=1, Tuple, Tuple, Tuple>>>>, Tuple)
                    1.93%  1.6951ms        20  84.753us  83.871us  85.599us  void scal_kernel_val<double2, double>(cublasScalParamsVal<double2, double>)
                    1.66%  1.4567ms        10  145.67us  144.29us  147.58us  _Z28julia_broadcast_kernel_2031515CuKernelContext13CuDeviceArrayI7ComplexI7Float64ELi3ELi1EE11BroadcastedIv5TupleI5OneToI5Int64ES5_IS6_ES5_IS6_EE2__S4_IS6_S3_I12CuArrayStyleILi3EEv5_realS4_IS3_IS8_ILi3EEvS7_S4_I8ExtrudedIS0_IS1_IS2_ELi3ELi1EES4_I4BoolS11_S11_ES4_IS6_S6_S6_EES10_IS0_IS1_IS2_ELi3ELi1EES4_IS11_S11_S11_ES4_IS6_S6_S6_EEEEEEEES6_
                    1.61%  1.4105ms        10  141.05us  139.39us  143.17us  _Z58julia_gpu_calculate_pressure_source_term_fft_based_solver_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE53_gpu_calculate_pressure_source_term_fft_based_solver_16CompilerMetadataI10StaticSizeI15_128__128__128_E12DynamicCheckvv7NDRangeILi3ES5_I11_8__8__128_ES5_I11_16__16__1_EvvEE13CuDeviceArrayI7ComplexI7Float64ELi3ELi1EE22RegularRectilinearGridIS10_8PeriodicS12_7Bounded11OffsetArrayIS10_Li1E12StepRangeLenIS10_14TwicePrecisionIS10_ES16_IS10_EEEE5Int6410NamedTupleI12__u___v___w_5TupleIS14_IS10_Li3ES8_IS10_Li3ELi1EEES14_IS10_Li3ES8_IS10_Li3ELi1EEES14_IS10_Li3ES8_IS10_Li3ELi1EEEEE
                    1.32%  1.1596ms        10  115.96us  114.50us  117.31us  _Z28julia_gpu_permute_z_indices_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE23_gpu_permute_z_indices_16CompilerMetadataI10StaticSizeI15_128__128__128_E12DynamicCheckvv7NDRangeILi3ES5_I11_8__8__128_ES5_I11_16__16__1_EvvEE13CuDeviceArrayI7ComplexI7Float64ELi3ELi1EES8_IS9_IS10_ELi3ELi1EE22RegularRectilinearGridIS10_8PeriodicS12_7Bounded11OffsetArrayIS10_Li1E12StepRangeLenIS10_14TwicePrecisionIS10_ES16_IS10_EEEE
                    1.31%  1.1496ms        10  114.96us  113.86us  116.48us  _Z30julia_gpu_unpermute_z_indices_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE25_gpu_unpermute_z_indices_16CompilerMetadataI10StaticSizeI15_128__128__128_E12DynamicCheckvv7NDRangeILi3ES5_I11_8__8__128_ES5_I11_16__16__1_EvvEE13CuDeviceArrayI7ComplexI7Float64ELi3ELi1EES8_IS9_IS10_ELi3ELi1EE22RegularRectilinearGridIS10_8PeriodicS12_7Bounded11OffsetArrayIS10_Li1E12StepRangeLenIS10_14TwicePrecisionIS10_ES16_IS10_EEEE
                    1.25%  1.0947ms        11  99.522us  97.696us  100.64us  _Z38julia_gpu_update_hydrostatic_pressure_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE33_gpu_update_hydrostatic_pressure_16CompilerMetadataI10StaticSizeI10_128__128_E12DynamicCheckvv7NDRangeILi2ES5_I6_8__8_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE22RegularRectilinearGridIS9_8PeriodicS12_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES15_IS9_EEEE8BuoyancyI14BuoyancyTracer10ZDirectionE10NamedTupleI5__b__5TupleIS8_IS9_Li3ES10_IS9_Li3ELi1EEEEE
                    1.15%  1.0115ms        10  101.15us  100.32us  101.98us  _Z28julia_broadcast_kernel_2045215CuKernelContext13CuDeviceArrayI7ComplexI7Float64ELi3ELi1EE11BroadcastedIv5TupleI5OneToI5Int64ES5_IS6_ES5_IS6_EE2__S4_IS3_I12CuArrayStyleILi3EEvS7_S4_I8ExtrudedIS0_IS1_IS2_ELi3ELi1EES4_I4BoolS10_S10_ES4_IS6_S6_S6_EEEES3_IS8_ILi3EEvS7_S4_IS3_IS8_ILi3EEvS7_S4_IS9_IS0_IS2_Li3ELi1EES4_IS10_S10_S10_ES4_IS6_S6_S6_EES9_IS0_IS2_Li3ELi1EES4_IS10_S10_S10_ES4_IS6_S6_S6_EES9_IS0_IS2_Li3ELi1EES4_IS10_S10_S10_ES4_IS6_S6_S6_EEEES6_EEEES6_
                    1.11%  974.43us       190  5.1280us  4.6080us  6.9760us  _Z27julia_broadcast_kernel_491915CuKernelContext8SubArrayI7Float64Li3E13CuDeviceArrayIS1_Li3ELi1EE5TupleI5SliceI5OneToI5Int64EE9UnitRangeIS6_ES4_IS5_IS6_EEELifalseEE11BroadcastedIvS3_IS5_IS6_ES5_IS6_ES5_IS6_EE9_identityS3_I8ExtrudedIS0_IS1_Li3ES2_IS1_Li3ELi1EES3_IS4_IS5_IS6_EES7_IS6_ES4_IS5_IS6_EEELifalseEES3_I4BoolS11_S11_ES3_IS6_S6_S6_EEEES6_
                    1.03%  905.27us        10  90.527us  90.239us  91.007us  julia_broadcast_kernel_20610(CuKernelContext, CuDeviceArray<Complex<Float64>, int=3, int=1>, Broadcasted<void, Tuple<OneTo<Int64>, Broadcasted<Tuple>, Broadcasted<Tuple>>, __, CuDeviceArray<Complex<Float64>, int=3, int=1, Extruded<CuDeviceArray<Complex<Float64>, int=3, int=1>, CuDeviceArray<Complex<Float64>, int=3, int=1, Bool, OneTo<Int64>, OneTo<Int64>>, CuDeviceArray<Complex<Float64>, int=3, int=1, Tuple, Tuple, Tuple>>, Int64<CuDeviceArray<Complex<Float64>, int=3, int=1>, CuDeviceArray<Complex<Float64>, int=3, int=1, OneTo<Int64>, OneTo<Int64>, OneTo<Int64>>, CuDeviceArray<Complex<Float64>, int=3, int=1, Tuple, Tuple, Tuple>>>>, Tuple)
                    0.82%  722.97us        10  72.296us  71.968us  72.703us  _Z30julia_gpu_copy_real_component_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE25_gpu_copy_real_component_16CompilerMetadataI10StaticSizeI15_128__128__128_E12DynamicCheckvv7NDRangeILi3ES5_I11_8__8__128_ES5_I11_16__16__1_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEES10_I7ComplexIS9_ELi3ELi1EE
                    0.70%  614.46us        10  61.446us  60.800us  62.463us  _Z33julia_partial_mapreduce_grid_71539_identity2__4Bool16CartesianIndicesILi3E5TupleI5OneToI5Int64ES4_IS5_ES4_IS5_EEES2_ILi3ES3_IS4_IS5_ES4_IS5_ES4_IS5_EEE3ValILitrueEE13CuDeviceArrayIS1_Li4ELi1EE11BroadcastedI12CuArrayStyleILi3EES3_IS4_IS5_ES4_IS5_ES4_IS5_EE6_isnanS3_IS7_I7Float64Li3ELi1EEEE
                    0.62%  545.31us        74  7.3690us  4.5440us  15.904us  _Z28julia_gpu__fill_bottom_halo_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE23_gpu__fill_bottom_halo_16CompilerMetadataI10StaticSizeI10_128__128_E12DynamicCheckvv7NDRangeILi2ES5_I6_8__8_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE17BoundaryConditionI4FluxvE5Int64S13_
                    0.61%  535.07us        74  7.2300us  3.9040us  15.104us  _Z25julia_gpu__fill_top_halo_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE20_gpu__fill_top_halo_16CompilerMetadataI10StaticSizeI10_128__128_E12DynamicCheckvv7NDRangeILi2ES5_I6_8__8_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE17BoundaryConditionI4FluxvE5Int64S13_
                    0.17%  151.97us        42  3.6180us  2.4960us  7.8720us  _Z36julia_gpu_set_top_bottom_w_velocity_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE31_gpu_set_top_bottom_w_velocity_16CompilerMetadataI10StaticSizeI10_128__128_E12DynamicCheckvv7NDRangeILi2ES5_I6_8__8_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE5Int6417BoundaryConditionI4OpenvE22RegularRectilinearGridIS9_8PeriodicS15_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES18_IS9_EEEE10NamedTupleI27__time___iteration___stage_5TupleIS9_S11_S11_EES19_I16__u___v___w___b_S20_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEE
                    0.12%  101.34us        10  10.134us  9.9520us  10.400us  _Z33julia_partial_mapreduce_grid_73419_identity2__4Bool16CartesianIndicesILi4E5TupleI5OneToI5Int64ES4_IS5_ES4_IS5_ES4_IS5_EEES2_ILi4ES3_IS4_IS5_ES4_IS5_ES4_IS5_ES4_IS5_EEE3ValILitrueEE13CuDeviceArrayIS1_Li5ELi1EES7_IS1_Li4ELi1EE
                    0.07%  63.776us        10  6.3770us  4.8320us  7.9040us  _Z23julia_gpu__apply_y_bcs_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu__apply_y_bcs_16CompilerMetadataI10StaticSizeI10_128__128_E12DynamicCheckvv7NDRangeILi2ES5_I6_8__8_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE5TupleI6CenterS12_S12_E22RegularRectilinearGridIS9_8PeriodicS14_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES17_IS9_EEEE17BoundaryConditionIS14_vES18_IS14_vE10NamedTupleI27__time___iteration___stage_S11_IS9_5Int64S20_EES19_I16__u___v___w___b_S11_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEE
                    0.07%  60.160us        10  6.0160us  4.0320us  7.2960us  _Z23julia_gpu__apply_z_bcs_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu__apply_z_bcs_16CompilerMetadataI10StaticSizeI10_128__128_E12DynamicCheckvv7NDRangeILi2ES5_I6_8__8_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE5TupleI4Face6CenterS13_E22RegularRectilinearGridIS9_8PeriodicS15_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES18_IS9_EEEE17BoundaryConditionI4FluxvES19_IS20_vE10NamedTupleI27__time___iteration___stage_S11_IS9_5Int64S22_EES21_I16__u___v___w___b_S11_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEE
                    0.07%  60.096us        10  6.0090us  5.1200us  7.3600us  _Z23julia_gpu__apply_x_bcs_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu__apply_x_bcs_16CompilerMetadataI10StaticSizeI10_128__128_E12DynamicCheckvv7NDRangeILi2ES5_I6_8__8_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE5TupleI6CenterS12_S12_E22RegularRectilinearGridIS9_8PeriodicS14_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES17_IS9_EEEE17BoundaryConditionIS14_vES18_IS14_vE10NamedTupleI27__time___iteration___stage_S11_IS9_5Int64S20_EES19_I16__u___v___w___b_S11_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEE
                    0.07%  60.096us        10  6.0090us  3.8080us  8.4160us  _Z23julia_gpu__apply_z_bcs_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu__apply_z_bcs_16CompilerMetadataI10StaticSizeI10_128__128_E12DynamicCheckvv7NDRangeILi2ES5_I6_8__8_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE5TupleI6CenterS12_S12_E22RegularRectilinearGridIS9_8PeriodicS14_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES17_IS9_EEEE17BoundaryConditionI4FluxvES18_IS19_vE10NamedTupleI27__time___iteration___stage_S11_IS9_5Int64S21_EES20_I16__u___v___w___b_S11_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEE
                    0.07%  57.952us        10  5.7950us  3.1040us  7.6480us  _Z23julia_gpu__apply_y_bcs_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu__apply_y_bcs_16CompilerMetadataI10StaticSizeI10_128__128_E12DynamicCheckvv7NDRangeILi2ES5_I6_8__8_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE5TupleI6Center4FaceS12_E22RegularRectilinearGridIS9_8PeriodicS15_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES18_IS9_EEEE17BoundaryConditionIS15_vES19_IS15_vE10NamedTupleI27__time___iteration___stage_S11_IS9_5Int64S21_EES20_I16__u___v___w___b_S11_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEE
                    0.06%  54.304us        10  5.4300us  3.2640us  7.6800us  _Z23julia_gpu__apply_y_bcs_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu__apply_y_bcs_16CompilerMetadataI10StaticSizeI10_128__128_E12DynamicCheckvv7NDRangeILi2ES5_I6_8__8_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE5TupleI6CenterS12_4FaceE22RegularRectilinearGridIS9_8PeriodicS15_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES18_IS9_EEEE17BoundaryConditionIS15_vES19_IS15_vE10NamedTupleI27__time___iteration___stage_S11_IS9_5Int64S21_EES20_I16__u___v___w___b_S11_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEE
                    0.06%  54.208us        10  5.4200us  2.6880us  7.5520us  _Z23julia_gpu__apply_z_bcs_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu__apply_z_bcs_16CompilerMetadataI10StaticSizeI10_128__128_E12DynamicCheckvv7NDRangeILi2ES5_I6_8__8_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE5TupleI6CenterS12_4FaceE22RegularRectilinearGridIS9_8PeriodicS15_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES18_IS9_EEEE17BoundaryConditionI4OpenvES19_IS20_vE10NamedTupleI27__time___iteration___stage_S11_IS9_5Int64S22_EES21_I16__u___v___w___b_S11_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEE
                    0.06%  53.024us        10  5.3020us  4.0960us  7.1680us  _Z23julia_gpu__apply_y_bcs_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu__apply_y_bcs_16CompilerMetadataI10StaticSizeI10_128__128_E12DynamicCheckvv7NDRangeILi2ES5_I6_8__8_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE5TupleI4Face6CenterS13_E22RegularRectilinearGridIS9_8PeriodicS15_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES18_IS9_EEEE17BoundaryConditionIS15_vES19_IS15_vE10NamedTupleI27__time___iteration___stage_S11_IS9_5Int64S21_EES20_I16__u___v___w___b_S11_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEE
                    0.06%  49.152us        10  4.9150us  2.4640us  7.1040us  _Z23julia_gpu__apply_x_bcs_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu__apply_x_bcs_16CompilerMetadataI10StaticSizeI10_128__128_E12DynamicCheckvv7NDRangeILi2ES5_I6_8__8_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE5TupleI6Center4FaceS12_E22RegularRectilinearGridIS9_8PeriodicS15_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES18_IS9_EEEE17BoundaryConditionIS15_vES19_IS15_vE10NamedTupleI27__time___iteration___stage_S11_IS9_5Int64S21_EES20_I16__u___v___w___b_S11_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEE
                    0.06%  48.640us        10  4.8640us  2.4640us  6.7520us  _Z23julia_gpu__apply_x_bcs_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu__apply_x_bcs_16CompilerMetadataI10StaticSizeI10_128__128_E12DynamicCheckvv7NDRangeILi2ES5_I6_8__8_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE5TupleI4Face6CenterS13_E22RegularRectilinearGridIS9_8PeriodicS15_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES18_IS9_EEEE17BoundaryConditionIS15_vES19_IS15_vE10NamedTupleI27__time___iteration___stage_S11_IS9_5Int64S21_EES20_I16__u___v___w___b_S11_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEE
                    0.06%  48.448us        10  4.8440us  3.1680us  7.2960us  _Z23julia_gpu__apply_x_bcs_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu__apply_x_bcs_16CompilerMetadataI10StaticSizeI10_128__128_E12DynamicCheckvv7NDRangeILi2ES5_I6_8__8_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE5TupleI6CenterS12_4FaceE22RegularRectilinearGridIS9_8PeriodicS15_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES18_IS9_EEEE17BoundaryConditionIS15_vES19_IS15_vE10NamedTupleI27__time___iteration___stage_S11_IS9_5Int64S21_EES20_I16__u___v___w___b_S11_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEE
                    0.05%  46.080us        10  4.6080us  3.2000us  7.7120us  _Z23julia_gpu__apply_z_bcs_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu__apply_z_bcs_16CompilerMetadataI10StaticSizeI10_128__128_E12DynamicCheckvv7NDRangeILi2ES5_I6_8__8_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE5TupleI6Center4FaceS12_E22RegularRectilinearGridIS9_8PeriodicS15_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES18_IS9_EEEE17BoundaryConditionI4FluxvES19_IS20_vE10NamedTupleI27__time___iteration___stage_S11_IS9_5Int64S22_EES21_I16__u___v___w___b_S11_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEE
                    0.01%  12.031us        10  1.2030us  1.1520us  1.5360us  [CUDA memcpy DtoH]
                    0.01%  8.5120us        10     851ns     736ns  1.6960us  [CUDA memcpy HtoD]
      API calls:   94.28%  480.79ms       931  516.43us  11.854us  466.22ms  cuLaunchKernel
                    2.89%  14.713ms      8320  1.7680us  1.3220us  11.862us  cuStreamQuery
                    1.44%  7.3189ms     13013     562ns     426ns  10.621us  cuCtxGetCurrent
                    0.32%  1.6540ms       982  1.6840us  1.1600us  10.417us  cuStreamWaitEvent
                    0.23%  1.1818ms        80  14.772us  12.505us  19.825us  cudaLaunchKernel
                    0.20%  1.0343ms       727  1.4220us  1.1430us  6.1680us  cuEventRecord
                    0.17%  884.02us       727  1.2150us     871ns  8.6740us  cuEventCreate
                    0.10%  504.05us        10  50.404us  5.7870us  429.47us  cuStreamCreate
                    0.08%  433.01us       440     984ns     810ns  3.1500us  cuOccupancyMaxPotentialBlockSize
                    0.07%  372.16us        18  20.675us  17.173us  28.561us  cuMemAlloc
                    0.07%  364.98us        20  18.248us  16.190us  33.333us  cuMemcpyDtoDAsync
                    0.05%  236.55us        10  23.655us  21.724us  26.417us  cuMemcpyDtoHAsync
                    0.04%  207.19us       370     559ns     478ns  1.6810us  cuDeviceGetAttribute
                    0.02%  114.83us        10  11.483us  10.220us  15.535us  cuMemcpyHtoDAsync
                    0.01%  50.198us        20  2.5090us  2.1560us  5.2240us  cuPointerGetAttribute
                    0.01%  29.353us        60     489ns     369ns     867ns  cudaGetErrorString
                    0.00%  22.948us        40     573ns     420ns  1.1350us  cudaGetLastError
                    0.00%  14.393us        20     719ns     588ns     862ns  cuCtxSetCurrent
                    0.00%  11.328us        20     566ns     531ns     593ns  cuCtxGetDevice
                    0.00%  4.5970us         1  4.5970us  4.5970us  4.5970us  cuDeviceGetCount
hennyg888 commented 3 years ago

@glwagner I also ran into some problems using StatProfilerHTML.jl to make flame graphs for CPU profiles. This is from the same script used to obtain the results above and shown in #1914 and it's a 128^3 nonhydrostatic model. The flame graphs don't display the function names, and all I can see is "overdub". By hovering my mouse over the slabs and going up each flame stack I can usually find a function name that makes sense somewhere but that prevents us from making at-a-glance analysis of the profile flame graph. image I thought that this might have something to do with profiling run(simulation, 10) instead of a for loop of time_step!(model,1) but apparently the result is the same for both cases.

francispoulin commented 3 years ago

Thanks @hennyg888 for sharing these results.

On the GPU I think it's great to see that the tendencies are the top 4 items on the list and the next is the time stepping.

I would have thought that pressure might be more expensive than any of these but apparently not.

glwagner commented 3 years ago

@glwagner I also ran into some problems using StatProfilerHTML.jl to make flame graphs for CPU profiles. This is from the same script used to obtain the results above and shown in #1914 and it's a 128^3 nonhydrostatic model. The flame graphs don't display the function names, and all I can see is "overdub". By hovering my mouse over the slabs and going up each flame stack I can usually find a function name that makes sense somewhere but that prevents us from making at-a-glance analysis of the profile flame graph. image I thought that this might have something to do with profiling run(simulation, 10) instead of a for loop of time_step!(model,1) but apparently the result is the same for both cases.

I believe this is inevitable, because all our kernels are compiled through Cassette.jl, which "overdubs" the julia compiler when compiling functions tagged with @kernel (the majority of our expensive kernels). This is part of the design of KernelAbstractions.jl...

Really great work @hennyg888. Perhaps the complexity of our function calls via KernelAbstractions.jl argues for a better profiling approach? Is there a way to "filter" the profiled output to remove data?

I think the next step towards improving performance is to figure out how to optimize the tendency calculations for CPU or GPU.

glwagner commented 3 years ago

@christophernhill do you think you could produce a script with non-trivial dynamics involving the HydrostaticFreeSurfaceModel and the implicit solver?

We should also come up with something that exercises the tridiagonal solver on a vertically-stretched grid.

francispoulin commented 3 years ago

@glwagner : but I remember you had a flame graph that actually had names of functions in #1919. What did you do differently there?

christophernhill commented 3 years ago

@christophernhill do you think you could produce a script with non-trivial dynamics involving the HydrostaticFreeSurfaceModel and the implicit solver?

We should also come up with something that exercises the tridiagonal solver on a vertically-stretched grid.

@glwagner @francispoulin and @hennyg888, we could start from https://github.com/CliMA/Oceananigans.jl/blob/master/validation/barotropic_gyre/barotropic_gyre.jl ? I'll check that it is still healthy. We can make the number of points bigger or smaller to look at problem size. Do we want to also try RegularLatitudeLongitudeGrid or should we do a box first . This also has an ImmersedBoundaryGrid bump in the domain - we can get rid of that for now, but could include that too down the road.

We should be able to add some vertical levels to this and turn on some implicit vertical diffusion - which is another tridiagonal solve?

glwagner commented 3 years ago

@glwagner : but I remember you had a flame graph that actually had names of functions in #1919. What did you do differently there?

I didn't do anything differently --- I think perhaps because it was a different problem, the flame graph results were different?

christophernhill commented 3 years ago

@glwagner @francispoulin and @hennyg888 I added #1928 toward being able to do a meaningful HydrostaticFreeSurface. When #1928 is fixed we should be good to add a setup for benchmarking. 🤞

glwagner commented 3 years ago

@christophernhill is it possible to come up with a benchmark that does not use ContinuousBoundaryFunction, thereby avoiding the bug in #1928 ?

francispoulin commented 3 years ago

@christophernhill : I see that #1928 has now been merged. Do you have an example that you would like us to try benchmarking?

hennyg888 commented 3 years ago

@christophernhill @glwagner @ali-ramadhan I obtained some interesting results from profiling the shallow water model running on GPU. This was done on Satori's login-002. The gist of it is that varying gird sizes does not change GPU activities except when the grid size gets very small e.g. 128 x 128. All other grid resolutions profiled had about the same GPU activities result as shown below and so only one set is shown. As far as @francispoulin and I know, the GPU activities seem to be correct, with what should be taking up the most time doing so. However, for API calls, results differ a lot based on grid resolution. As the grid increases in size, cuStreamQuery and eventually cuCtxGetCurrent becomes the dominant API call. See below the API call profile result tables for the different grid sizes. It seems that cuStreamQuery is what is checking on the status of the cuda streams so larger grids taking more time to run the kernels than launching the kernels may have something to do with it.

Oceananigans v0.61.0
Julia Version 1.6.2
Commit 1b93d53fc4 (2021-07-14 15:36 UTC)
Platform Info:
  OS: Linux (powerpc64le-unknown-linux-gnu)
  CPU: unknown
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, pwr9)
  GPU: Tesla V100-SXM2-32GB

CUDA toolkit 10.2.89, local installation
CUDA driver 10.2.0
NVIDIA driver 440.64.0

Libraries: 
- CUBLAS: 10.2.2
- CURAND: 10.1.2
- CUFFT: 10.1.2
- CUSOLVER: 10.3.0
- CUSPARSE: 10.3.1
- CUPTI: 12.0.0
- NVML: 10.0.0+440.64.0
- CUDNN: missing
- CUTENSOR: missing

Toolchain:
- Julia: 1.6.2
- LLVM: 11.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5
- Device capability support: sm_30, sm_32, sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75

2 devices:
  0: Tesla V100-SXM2-32GB (sm_70, 31.432 GiB / 31.749 GiB available)
  1: Tesla V100-SXM2-32GB (sm_70, 31.738 GiB / 31.749 GiB available)
nothing

[2021/08/11 22:39:51.084] INFO  Setting up benchmark: (GPU, Float64, 2048)...
[2021/08/11 22:40:32.330] INFO  warming up
[2021/08/11 22:41:32.311] INFO  Simulation is stopping. Model iteration 1 has hit or exceeded simulation stop iteration 1.
[2021/08/11 22:41:32.729] WARN  Calling CUDA.@profile only informs an external profiler to start.
The user is responsible for launching Julia under a CUDA profiler.

It is recommended to use Nsight Systems, which supports interactive profiling:
$ nsys launch julia -@-> /home/henryguo/.julia/packages/CUDA/CtvPY/lib/cudadrv/profile.jl:71
[2021/08/11 22:41:32.777] INFO  Simulation is stopping. Model iteration 11 has hit or exceeded simulation stop iteration 11.
[2021/08/11 22:41:34.842] INFO  done profiling (GPU, Float64, 2048)
==41185== Profiling application: /nobackup/users/henryguo/projects/henry-test/julia-1.6.2/bin/julia --project shallow_water_profiler.jl
==41185== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   36.32%  15.483ms        10  1.5483ms  1.5398ms  1.5571ms  _Z24julia_gpu_calculate_Gvh_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE19_gpu_calculate_Gvh_16CompilerMetadataI10StaticSizeI15_2048__2048__1_E12DynamicCheckvv7NDRangeILi3ES5_I13_128__128__1_ES5_I11_16__16__1_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE22RegularRectilinearGridIS9_8PeriodicS12_4FlatS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES15_IS9_EEEES9_5WENO5vvv10NamedTupleI14__uh___vh___h_5TupleIS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEES17_I2__S18_EvS17_I14__uh___vh___h_S18_I12_zeroforcingS19_S19_EES17_I27__time___iteration___stage_S18_IS9_5Int64S20_EE
                   35.40%  15.088ms        10  1.5088ms  1.5042ms  1.5122ms  _Z24julia_gpu_calculate_Guh_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE19_gpu_calculate_Guh_16CompilerMetadataI10StaticSizeI15_2048__2048__1_E12DynamicCheckvv7NDRangeILi3ES5_I13_128__128__1_ES5_I11_16__16__1_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE22RegularRectilinearGridIS9_8PeriodicS12_4FlatS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES15_IS9_EEEES9_5WENO5vvv10NamedTupleI14__uh___vh___h_5TupleIS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEES17_I2__S18_EvS17_I14__uh___vh___h_S18_I12_zeroforcingS19_S19_EES17_I27__time___iteration___stage_S18_IS9_5Int64S20_EE
                   13.03%  5.5520ms        30  185.07us  178.24us  192.03us  _Z25julia_gpu_ab2_step_field_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE20_gpu_ab2_step_field_16CompilerMetadataI10StaticSizeI15_2048__2048__1_E12DynamicCheckvv7NDRangeILi3ES5_I13_128__128__1_ES5_I11_16__16__1_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE5Int64S9_S8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEE
                    7.44%  3.1730ms        30  105.77us  103.10us  110.40us  _Z33julia_gpu_store_field_tendencies_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE28_gpu_store_field_tendencies_16CompilerMetadataI10StaticSizeI15_2048__2048__1_E12DynamicCheckvv7NDRangeILi3ES5_I13_128__128__1_ES5_I11_16__16__1_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE22RegularRectilinearGridIS9_8PeriodicS12_4FlatS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES15_IS9_EEEES8_IS9_Li3ES10_IS9_Li3ELi1EEE
                    3.32%  1.4150ms        10  141.50us  140.86us  142.21us  _Z23julia_gpu_calculate_Gh_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu_calculate_Gh_16CompilerMetadataI10StaticSizeI15_2048__2048__1_E12DynamicCheckvv7NDRangeILi3ES5_I13_128__128__1_ES5_I11_16__16__1_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE22RegularRectilinearGridIS9_8PeriodicS12_4FlatS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES15_IS9_EEEES9_vvv10NamedTupleI14__uh___vh___h_5TupleIS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEES16_I2__S17_EvS16_I14__uh___vh___h_S17_I12_zeroforcingS18_S18_EES16_I27__time___iteration___stage_S17_IS9_5Int64S19_EE
                    2.27%  966.33us        10  96.633us  95.647us  99.072us  _Z33julia_partial_mapreduce_grid_60479_identity2__4Bool16CartesianIndicesILi3E5TupleI5OneToI5Int64ES4_IS5_ES4_IS5_EEES2_ILi3ES3_IS4_IS5_ES4_IS5_ES4_IS5_EEE3ValILitrueEE13CuDeviceArrayIS1_Li4ELi1EE11BroadcastedI12CuArrayStyleILi3EES3_IS4_IS5_ES4_IS5_ES4_IS5_EE6_isnanS3_IS7_I7Float64Li3ELi1EEEE
                    0.79%  337.76us        66  5.1170us  4.8000us  5.6960us  _Z27julia_broadcast_kernel_514115CuKernelContext8SubArrayI7Float64Li3E13CuDeviceArrayIS1_Li3ELi1EE5TupleI9UnitRangeI5Int64E5SliceI5OneToIS5_EES6_IS7_IS5_EEELifalseEE11BroadcastedIvS3_IS7_IS5_ES7_IS5_ES7_IS5_EE9_identityS3_I8ExtrudedIS0_IS1_Li3ES2_IS1_Li3ELi1EES3_IS4_IS5_ES6_IS7_IS5_EES6_IS7_IS5_EEELifalseEES3_I4BoolS11_S11_ES3_IS5_S5_S5_EEEES5_
                    0.68%  289.05us        66  4.3790us  3.9360us  5.0240us  _Z27julia_broadcast_kernel_530115CuKernelContext8SubArrayI7Float64Li3E13CuDeviceArrayIS1_Li3ELi1EE5TupleI5SliceI5OneToI5Int64EE9UnitRangeIS6_ES4_IS5_IS6_EEELifalseEE11BroadcastedIvS3_IS5_IS6_ES5_IS6_ES5_IS6_EE9_identityS3_I8ExtrudedIS0_IS1_Li3ES2_IS1_Li3ELi1EES3_IS4_IS5_IS6_EES7_IS6_ES4_IS5_IS6_EEELifalseEES3_I4BoolS11_S11_ES3_IS6_S6_S6_EEEES6_
                    0.20%  83.359us        10  8.3350us  7.1680us  11.008us  _Z33julia_partial_mapreduce_grid_62649_identity2__4Bool16CartesianIndicesILi4E5TupleI5OneToI5Int64ES4_IS5_ES4_IS5_ES4_IS5_EEES2_ILi4ES3_IS4_IS5_ES4_IS5_ES4_IS5_ES4_IS5_EEE3ValILitrueEE13CuDeviceArrayIS1_Li5ELi1EES7_IS1_Li4ELi1EE
                    0.10%  42.590us        10  4.2590us  3.2320us  5.2480us  _Z23julia_gpu__apply_x_bcs_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu__apply_x_bcs_16CompilerMetadataI10StaticSizeI9_2048__1_E12DynamicCheckvv7NDRangeILi2ES5_I8_128__1_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE5TupleI6CenterS12_S12_E22RegularRectilinearGridIS9_8PeriodicS14_4FlatS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES17_IS9_EEEE17BoundaryConditionIS14_vES18_IS14_vE10NamedTupleI27__time___iteration___stage_S11_IS9_5Int64S20_EES19_I14__uh___vh___h_S11_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEE
                    0.09%  39.486us        10  3.9480us  2.7840us  4.6720us  _Z23julia_gpu__apply_y_bcs_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu__apply_y_bcs_16CompilerMetadataI10StaticSizeI9_2048__1_E12DynamicCheckvv7NDRangeILi2ES5_I8_128__1_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE5TupleI6Center4FaceS12_E22RegularRectilinearGridIS9_8PeriodicS15_4FlatS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES18_IS9_EEEE17BoundaryConditionIS15_vES19_IS15_vE10NamedTupleI27__time___iteration___stage_S11_IS9_5Int64S21_EES20_I14__uh___vh___h_S11_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEE
                    0.09%  38.720us        10  3.8720us  2.5920us  5.0240us  _Z23julia_gpu__apply_y_bcs_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu__apply_y_bcs_16CompilerMetadataI10StaticSizeI9_2048__1_E12DynamicCheckvv7NDRangeILi2ES5_I8_128__1_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE5TupleI4Face6CenterS13_E22RegularRectilinearGridIS9_8PeriodicS15_4FlatS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES18_IS9_EEEE17BoundaryConditionIS15_vES19_IS15_vE10NamedTupleI27__time___iteration___stage_S11_IS9_5Int64S21_EES20_I14__uh___vh___h_S11_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEE
                    0.08%  34.656us        10  3.4650us  3.1680us  4.1920us  _Z23julia_gpu__apply_x_bcs_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu__apply_x_bcs_16CompilerMetadataI10StaticSizeI9_2048__1_E12DynamicCheckvv7NDRangeILi2ES5_I8_128__1_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE5TupleI6Center4FaceS12_E22RegularRectilinearGridIS9_8PeriodicS15_4FlatS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES18_IS9_EEEE17BoundaryConditionIS15_vES19_IS15_vE10NamedTupleI27__time___iteration___stage_S11_IS9_5Int64S21_EES20_I14__uh___vh___h_S11_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEE
                    0.08%  33.886us        10  3.3880us  2.5280us  4.9270us  _Z23julia_gpu__apply_x_bcs_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu__apply_x_bcs_16CompilerMetadataI10StaticSizeI9_2048__1_E12DynamicCheckvv7NDRangeILi2ES5_I8_128__1_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE5TupleI4Face6CenterS13_E22RegularRectilinearGridIS9_8PeriodicS15_4FlatS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES18_IS9_EEEE17BoundaryConditionIS15_vES19_IS15_vE10NamedTupleI27__time___iteration___stage_S11_IS9_5Int64S21_EES20_I14__uh___vh___h_S11_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEE
                    0.08%  33.696us        10  3.3690us  2.5920us  4.0320us  _Z23julia_gpu__apply_y_bcs_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu__apply_y_bcs_16CompilerMetadataI10StaticSizeI9_2048__1_E12DynamicCheckvv7NDRangeILi2ES5_I8_128__1_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE5TupleI6CenterS12_S12_E22RegularRectilinearGridIS9_8PeriodicS14_4FlatS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES17_IS9_EEEE17BoundaryConditionIS14_vES18_IS14_vE10NamedTupleI27__time___iteration___stage_S11_IS9_5Int64S20_EES19_I14__uh___vh___h_S11_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEE
                    0.03%  12.800us        10  1.2800us  1.2160us  1.6320us  [CUDA memcpy DtoH]

grid = 16384 x 16384
      API calls:   70.92%  702.12ms    468805  1.4970us  1.2730us  101.02us  cuStreamQuery
                   28.00%  277.25ms    470363     589ns     433ns  15.851us  cuCtxGetCurrent
                    0.85%  8.3729ms       302  27.724us  11.380us  3.7689ms  cuLaunchKernel
                    0.05%  493.73us       300  1.6450us  1.1820us  4.9350us  cuStreamWaitEvent
                    0.04%  369.46us       253  1.4600us  1.2090us  3.5480us  cuEventRecord
                    0.03%  344.38us        20  17.218us  12.297us  22.727us  cuMemAlloc
                    0.03%  326.83us       253  1.2910us     939ns  2.5510us  cuEventCreate
                    0.03%  283.23us        10  28.323us  26.575us  32.548us  cuMemcpyDtoHAsync
                    0.02%  218.41us       152  1.4360us  1.2710us  6.3380us  cuOccupancyMaxPotentialBlockSize
                    0.02%  208.20us       370     562ns     480ns     851ns  cuDeviceGetAttribute
                    0.00%  24.869us        10  2.4860us  2.3320us  2.8590us  cuPointerGetAttribute
                    0.00%  16.819us         2  8.4090us  6.1830us  10.636us  cuStreamCreate
                    0.00%  14.325us        20     716ns     567ns     920ns  cuCtxSetCurrent
                    0.00%  10.905us        20     545ns     502ns     576ns  cuCtxGetDevice
                    0.00%  2.3610us         1  2.3610us  2.3610us  2.3610us  cuDeviceGetCount

grid = 4096 x 4096
      API calls:   60.78%  39.901ms     26114  1.5270us  1.2380us  125.51us  cuStreamQuery
                   22.99%  15.091ms     27670     545ns     432ns  5.9410us  cuCtxGetCurrent
                   12.95%  8.5006ms       302  28.147us  11.910us  3.9653ms  cuLaunchKernel
                    0.74%  483.32us       300  1.6110us  1.2110us  3.1970us  cuStreamWaitEvent
                    0.56%  369.64us       253  1.4610us  1.2300us  4.6640us  cuEventRecord
                    0.49%  319.93us       253  1.2640us     951ns  3.4240us  cuEventCreate
                    0.40%  261.89us        18  14.549us  11.922us  23.596us  cuMemAlloc
                    0.37%  241.30us        10  24.129us  20.979us  34.250us  cuMemcpyDtoHAsync
                    0.33%  214.83us       152  1.4130us  1.2690us  2.7320us  cuOccupancyMaxPotentialBlockSize
                    0.31%  201.30us       370     544ns     471ns     996ns  cuDeviceGetAttribute
                    0.04%  23.055us        10  2.3050us  1.7710us  4.1930us  cuPointerGetAttribute
                    0.03%  17.034us         2  8.5170us  6.2180us  10.816us  cuStreamCreate
                    0.02%  13.902us        20     695ns     574ns  1.0230us  cuCtxSetCurrent
                    0.02%  10.967us        20     548ns     477ns     719ns  cuCtxGetDevice
                    0.00%  3.0570us         1  3.0570us  3.0570us  3.0570us  cuDeviceGetCount

grid = 2048 x 2048
      API calls:   37.92%  8.8570ms       302  29.327us  11.432us  4.4105ms  cuLaunchKernel
                   36.94%  8.6294ms      5393  1.6000us  1.2680us  8.0800us  cuStreamQuery
                   15.99%  3.7341ms      6949     537ns     432ns  5.0180us  cuCtxGetCurrent
                    2.13%  496.43us       300  1.6540us  1.2310us  3.9350us  cuStreamWaitEvent
                    1.56%  364.41us       253  1.4400us  1.2460us  3.5350us  cuEventRecord
                    1.34%  313.77us       253  1.2400us     912ns  3.3890us  cuEventCreate
                    1.08%  251.42us        18  13.967us  11.806us  23.128us  cuMemAlloc
                    0.99%  230.45us        10  23.045us  20.917us  32.999us  cuMemcpyDtoHAsync
                    0.91%  212.61us       152  1.3980us  1.2300us  2.4020us  cuOccupancyMaxPotentialBlockSize
                    0.87%  203.87us       370     551ns     484ns     924ns  cuDeviceGetAttribute
                    0.08%  19.701us        10  1.9700us  1.7380us  3.3080us  cuPointerGetAttribute
                    0.07%  17.108us         2  8.5540us  6.2570us  10.851us  cuStreamCreate
                    0.06%  14.465us        20     723ns     560ns  1.2330us  cuCtxSetCurrent
                    0.05%  11.167us        20     558ns     459ns     785ns  cuCtxGetDevice
                    0.01%  2.2130us         1  2.2130us  2.2130us  2.2130us  cuDeviceGetCount

gird = 512 x 512
      API calls:   67.86%  8.3255ms       302  27.567us  11.810us  3.8990ms  cuLaunchKernel
                    7.98%  979.53us      1731     565ns     443ns  2.9160us  cuCtxGetCurrent
                    6.89%  845.51us       173  4.8870us  1.4420us  7.5840us  cuStreamQuery
                    3.82%  468.14us       300  1.5600us  1.1470us  2.6330us  cuStreamWaitEvent
                    2.94%  360.57us       253  1.4250us  1.2050us  9.9840us  cuEventRecord
                    2.59%  317.60us       253  1.2550us     932ns  3.1190us  cuEventCreate
                    2.19%  268.74us        20  13.436us  11.420us  23.667us  cuMemAlloc
                    1.87%  229.49us        10  22.948us  21.019us  31.754us  cuMemcpyDtoHAsync
                    1.72%  211.30us       152  1.3900us  1.2580us  2.3280us  cuOccupancyMaxPotentialBlockSize
                    1.63%  199.48us       370     539ns     469ns     756ns  cuDeviceGetAttribute
                    0.16%  19.342us        10  1.9340us  1.7360us  2.9230us  cuPointerGetAttribute
                    0.14%  17.131us         2  8.5650us  6.6240us  10.507us  cuStreamCreate
                    0.11%  13.659us        20     682ns     613ns     853ns  cuCtxSetCurrent
                    0.09%  11.188us        20     559ns     516ns     846ns  cuCtxGetDevice
                    0.02%  2.3790us         1  2.3790us  2.3790us  2.3790us  cuDeviceGetCount

grid = 128 x 128
      API calls:   66.93%  8.2732ms       302  27.394us  11.588us  3.8998ms  cuLaunchKernel
                    7.77%  959.95us      1731     554ns     433ns  2.5960us  cuCtxGetCurrent
                    6.96%  860.47us       173  4.9730us  4.4450us  7.9010us  cuStreamQuery
                    3.79%  468.98us       300  1.5630us  1.1700us  3.6250us  cuStreamWaitEvent
                    2.96%  365.37us       253  1.4440us  1.2160us  3.8400us  cuEventRecord
                    2.90%  358.58us       152  2.3590us  1.2750us  16.503us  cuOccupancyMaxPotentialBlockSize
                    2.57%  317.68us       253  1.2550us     920ns  3.3410us  cuEventCreate
                    2.21%  272.61us        20  13.630us  11.594us  23.538us  cuMemAlloc
                    1.84%  227.46us        10  22.745us  20.907us  32.177us  cuMemcpyDtoHAsync
                    1.55%  191.40us       350     546ns     485ns  1.0060us  cuDeviceGetAttribute
                    0.17%  21.476us        10  2.1470us  1.9050us  3.5970us  cuPointerGetAttribute
                    0.14%  17.065us         2  8.5320us  6.3880us  10.677us  cuStreamCreate
                    0.11%  13.557us        20     677ns     590ns     802ns  cuCtxSetCurrent
                    0.09%  10.935us        20     546ns     494ns     590ns  cuCtxGetDevice
                    0.02%  2.3300us         1  2.3300us  2.3300us  2.3300us  cuDeviceGetCount
hennyg888 commented 3 years ago

@christophernhill I also took a look at the GFlops.jl package. As said on its homepage: "GFlops.jl does not see what happens outside the realm of Julia code. It especially does not see operations performed in external libraries such as BLAS calls." It works similarly to the profile macro and it can count basic math operations performed by whatever follows the macro or benchmark it for its Flops metric. These doesn't seem to work with simulations but works fine for time_step!(model, 1) due to the benchmarking process performing many evaluations of the code. For the nonhydrostatic model running on CPU, @count_ops did not produce any results for either the simulation run or the time_step!, and @gflops produced the results below for time step!:

  0.02 GFlops,  0.04% peak  (1.89e+07 flop, 1.01e+00 s)
hennyg888 commented 3 years ago

According to @maleadt on the Julia slack's GPU channel and in regards to the shallow water model profiles:

Don't focus on time spent in API calls to much. since GPU execution is asynchronous, you'll have to synchronize at some point, and that API call will then 'soak up' time until the stream has finished executing. and here that's literally the synchronize function, which is implemented using cuStreamQuery: https://github.com/JuliaGPU/CUDA.jl/blob/2b3ec03ff9774b65541fc88dd6b0f1f7aea5d9e0/lib/cudadrv/stream.jl#L115-L144

use a timeline profiler (i.e. NSNight Systems) to profile your app, or nvpp if you really want to use the old profiler toolchain. plain nvprof results are too simple once your application hits some level of complexity

now, it is possible that our CPU-side implementation of synchronize does too many API calls and could be optimized a little, but in the end the call serves to wait until the GPU has finished so it probably doesn't matter much. if it does, e.g. because you want to perform other useful work on another CPU task concurrently, you could try to profile that in isolation and file an issue.

Essentially, Tim explains that cuStreamQuery takes up more time as the grid size increases because it's called in the synchronize function. The synchronize function as shown in the link above tends to be called more and soaks up more waiting time the bigger the problem hence why it scales positively to grid size. Taking a closer look at the shallow water gpu profiling results above, it seems that cuStreamQuery takes up a lot of time in the finer resolution runs because it is called many times and not because each call takes a lot of time. For example, in the 16k case, cuSteamQuery is called three order of magnitudes more times than cuLaunchKernel while both calls are measured in microseconds. I'm not sure if cuStreamQuery being called 400,000 times is an error with our code, an error with CUDA.jl, not an error at all, or an error with my profiling.

maleadt commented 3 years ago

I'm not sure if cuStreamQuery being called 400,000 times is an error with our code, an error with CUDA.jl, not an error at all, or an error with my profiling.

I didn't know this was a KA.jl-based GPU workload when commenting on Slack. The dependency/event model of KernelAbstractions.jl also uses stream queries (i.e. cuStreamQuery) when selecting a new stream. Maybe that's the source of these calls. It'd be good to figure out where they come from: if it's from CUDA.jl, and thus presumably because of calling the synchronize function, (1) why are you synchronizing that much [1], and if it's for good reasons (2) does it hurt performance and should we tweak our synchronize implementation to perform fewer stream queries?

[1]: some synchronization happens implicitly, e.g. when copying memory to or from the CPU (https://github.com/JuliaGPU/CUDA.jl/blob/6758fcab7ae0d72659a1ca0d56ad2c86d3b451f1/src/array.jl#L385-L399). One way to avoid some of those synchronizations, is by using pinned memory, but that's up to the application.

hennyg888 commented 3 years ago

@maleadt I used Nsight System's nsys to profile the exact same shallow water model setup shown above with grid size being 16384 x 16384 and got the following results: image From what I can see, the CUDA API row only starts getting filled with activities towards the end of run and most of it is cuStreamWaitEvent and some memcpy's. Another thing to note is that while viewing the CUDA API row's info in events view as shown in the table below, I could not find one, much less 400,000, calls to cuStreamQuery. As seen in the table, I sorted the events by name and cuStreamQuery is nowhere to be found between cuStreamDestroy and cuStreamWaitEvent.