Closed hennyg888 closed 1 year ago
Nice, thanks!
I'm pretty surprised that ab2_step_field!
dominates the cost. ab2_step_field!
is this simple function:
which seems much cheaper than something like calculate_Gu!
. What's going on?
I'm also noticing that function is a bit sketchy because it uses the type of χ
to convert 1.5
and 0.5
. This is fine if χ
is a floating point number, but not otherwise... it should probably use eltype(U)
.
How did you run the profiler? Does it make sense to add a new profile
directory to the source code (or maybe just add something to benchmark/
)?
Might be worthwhile to profile with timestepper=:RungeKutta3
as a sanity check, considering that this benchmark suggests a simple time-stepping function is 12% (!) of the cost.
Another thought --- we should probably benchmark "fully loaded" models that at least use WENO advection (and perhaps some turbulence closure?), since that's more realistic. I think most usage of NonhydrostaticModel
also has one tracer, rather than two (someday, we should change that default...)
I just edited an old benchmarkable incompressible model script to only have the model setup and time stepping. I did not profile from the start, and only profiled the time_step! function line.
I feel like the profiles are more dependent on which system have which profiler, so it might make sense to just add a few simple scripts in benchmark
that just consist of model setup and timestep and those can be called profiliables/benchmarkables.
@hennyg888 , when you have time, if you could add this line into model
timestepper = :RungeKutta3,
It should use a different time stepping scheme called RungeKutta3
. This method should actually be slower but it would be of interest to see if it takes up more or less than the 12% that the default AdamsBashforth2
scheme uses.
I just edited an old benchmarkable incompressible model script to only have the model setup and time stepping. I did not profile from the start, and only profiled the time_step! function line. I feel like the profiles are more dependent on which system have which profiler, so it might make sense to just add a few simple scripts in
benchmark
that just consist of model setup and timestep and those can be called profiliables/benchmarkables.
Ok! I can help with that.
@hennyg888 etc.. - this looks great. If we can get some scripts together then we can start automating some of this so we can see how things change, as well as tracking down anomalies.
CUDA.jl has a chart ( https://speed.juliagpu.org/changes/?exe=6&env=1&tre=50 ) that shows timing trends for different bits of the system. Not sure how they generate this!
This https://github.com/tobami/codespeed looks to be what CUDA.jl timings tracking is based on.
Here's the code used for the profiling.
push!(LOAD_PATH, joinpath(@__DIR__, ".."))
#using BenchmarkTools
using CUDA
using Oceananigans
using Benchmarks
# Benchmark parameters
Arch = GPU
FT = Float64
N = 128
print_system_info()
# Define benchmarks
@info "Setting up benchmark: ($Arch, $FT, $N)..."
grid = RegularRectilinearGrid(FT, size=(N, N, N), extent=(1, 1, 1))
model = NonhydrostaticModel(architecture=Arch(), grid=grid)
@info "warming up"
time_step!(model, 1)
CUDA.@profile time_step!(model, 10000)
@info "done profiling ($Arch, $FT, $N)"
CPU profile with script shown in #1914. Scroll to the right to see the specific line in the overhead file and function name and parameters. Sorted by ascending counts of backtrace samples. Flat format is used as the tree format showing hierarchy is wordy enough to have +3000 lines. Functions with sample counts less than 100 have been removed manually. Sample counts are taken in regular periods, the more a function shows up in a sample, the more counts it has, and the more time consuming it is.
Count Overhead File Line Function
===== ======== ==== ==== ========
101 0 @Oceananigans/src/TurbulenceClosures/abstract_isotropic_diffusivity_closure.jl 37 ν_σᶠᶜᶠ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.Twic...
101 0 @Oceananigans/src/TurbulenceClosures/abstract_isotropic_diffusivity_closure.jl 37 overdub
103 0 @Oceananigans/src/Advection/weno_fifth_order.jl 106 left_biased_αy₂(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, ...
103 0 @Oceananigans/src/Solvers/fft_based_poisson_solver.jl 52 solve_poisson_equation!(solver::Oceananigans.Solvers.FFTBasedPoissonSolver{CPU, RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Floa...
104 0 @Oceananigans/src/Operators/interpolation_operators.jl 21 ℑxᶠᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.Twice...
104 0 @Oceananigans/src/Advection/centered_fourth_order.jl 22 symmetric_interpolate_xᶠᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision...
104 0 @Oceananigans/src/Advection/weno_fifth_order.jl 11 symmetric_interpolate_xᶠᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision...
104 0 @Oceananigans/src/Advection/topologically_conditional_interpolation.jl 36 _symmetric_interpolate_xᶠᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecisio...
104 0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl 38 advective_momentum_flux_Wu(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision...
104 0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl 38 overdub
105 0 @Oceananigans/src/Operators/interpolation_operators.jl 21 ℑxᶠᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.Twice...
105 0 @Oceananigans/src/Advection/centered_fourth_order.jl 22 symmetric_interpolate_xᶠᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision...
105 0 @Oceananigans/src/Advection/weno_fifth_order.jl 11 symmetric_interpolate_xᶠᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision...
105 0 @Oceananigans/src/Advection/topologically_conditional_interpolation.jl 36 _symmetric_interpolate_xᶠᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecisio...
105 0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl 29 advective_momentum_flux_Vu(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision...
105 0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl 29 overdub
105 0 @Oceananigans/src/Fields/abstract_field.jl 200 setindex!(::Field{Center, Face, Center, CPU, OffsetArrays.OffsetArray{Float64, 3, Array{Float64, 3}}, RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Floa...
106 0 @Base/subarray.jl 276 getindex
106 0 @Base/abstractarray.jl 1214 _getindex
106 0 @Oceananigans/src/Advection/weno_fifth_order.jl 221 overdub
106 0 @Oceananigans/src/Advection/weno_fifth_order.jl 112 right_biased_αx₀(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64},...
106 0 @Oceananigans/src/Advection/weno_fifth_order.jl 168 right_biased_weno5_weights_x(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecisi...
107 107 @Base/array.jl 802 getindex
111 0 @Oceananigans/src/Fields/abstract_field.jl 231 fill_halo_regions!(::Field{Face, Center, Center, CPU, OffsetArrays.OffsetArray{Float64, 3, Array{Float64, 3}}, RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVe...
111 0 @Oceananigans/src/BoundaryConditions/fill_halo_regions.jl 18 fill_halo_regions!(::NamedTuple{(:u, :v, :w, :b), Tuple{Field{Face, Center, Center, CPU, OffsetArrays.OffsetArray{Float64, 3, Array{Float64, 3}}, RegularRectilinearGrid{Float64, Periodic, Perio...
111 0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl 74 advective_momentum_flux_Uw(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision...
111 0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl 74 overdub
112 0 @Oceananigans/src/Operators/difference_operators.jl 23 δyᵃᶜᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.Twice...
112 0 @Oceananigans/src/Operators/products_between_fields_and_grid_metrics.jl 45 Az_ηᶠᶜᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.Twi...
112 0 @Oceananigans/src/Operators/products_between_fields_and_grid_metrics.jl 45 overdub
112 0 @Oceananigans/src/Advection/weno_fifth_order.jl 105 left_biased_αy₁(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, ...
113 0 @Base/abstractarray.jl 1170 getindex
117 0 @Oceananigans/src/Advection/weno_fifth_order.jl 114 right_biased_αx₂(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64},...
119 0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl 60 overdub
120 0 @AbstractFFTs/src/definitions.jl 249 *(p::AbstractFFTs.ScaledPlan{ComplexF64, FFTW.cFFTWPlan{ComplexF64, 1, true, 3, Vector{Int64}}, Float64}, x::Array{ComplexF64, 3})
122 0 @Oceananigans/src/Advection/weno_fifth_order.jl 118 right_biased_αy₂(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64},...
126 0 @Oceananigans/src/Advection/weno_fifth_order.jl 146 overdub
127 0 @Oceananigans/src/Operators/interpolation_operators.jl 24 overdub
127 0 @Oceananigans/src/Advection/centered_fourth_order.jl 25 overdub
127 0 @Oceananigans/src/Advection/weno_fifth_order.jl 12 overdub
130 0 @Base/abstractarray.jl 984 copyto_unaliased!(deststyle::IndexCartesian, dest::SubArray{Float64, 3, Array{Float64, 3}, Tuple{Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}, Base.Slice{Base.OneTo{Int64}}}, false}, srcstyl...
130 0 @Base/abstractarray.jl 950 copyto!
130 0 @Base/broadcast.jl 977 copyto!
130 0 @Oceananigans/src/Fields/abstract_field.jl 200 setindex!(::Field{Center, Center, Center, CPU, OffsetArrays.OffsetArray{Float64, 3, Array{Float64, 3}}, RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Fl...
132 0 @Oceananigans/src/Advection/weno_fifth_order.jl 116 right_biased_αy₀(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64},...
132 0 @Oceananigans/src/Advection/weno_fifth_order.jl 181 right_biased_weno5_weights_y(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecisi...
133 0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl 96 overdub
134 0 @Base/broadcast.jl 984 macro expansion
134 0 @Base/broadcast.jl 983 copyto!
135 0 @Oceananigans/src/Advection/weno_fifth_order.jl 172 overdub
136 0 @Oceananigans/src/BoundaryConditions/fill_halo_regions.jl 30 fill_halo_regions!
136 0 @Oceananigans/src/Advection/centered_fourth_order.jl 12 ℑ³xᶠᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.Twic...
136 0 @Oceananigans/src/Advection/centered_fourth_order.jl 12 overdub
136 0 @Oceananigans/src/Advection/weno_fifth_order.jl 102 left_biased_αx₂(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, ...
139 0 @Oceananigans/src/Operators/difference_operators.jl 26 δzᵃᵃᶜ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.Twice...
141 0 @Oceananigans/src/Advection/weno_fifth_order.jl 101 left_biased_αx₁(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, ...
150 0 @Oceananigans/src/Operators/interpolation_operators.jl 20 ℑxᶜᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.Twice...
150 0 @Oceananigans/src/Operators/interpolation_operators.jl 20 overdub
150 0 @Oceananigans/src/Advection/centered_fourth_order.jl 21 symmetric_interpolate_xᶜᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision...
150 0 @Oceananigans/src/Advection/centered_fourth_order.jl 21 overdub
150 0 @Oceananigans/src/Advection/weno_fifth_order.jl 15 symmetric_interpolate_xᶜᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision...
150 0 @Oceananigans/src/Advection/weno_fifth_order.jl 15 overdub
150 0 @Oceananigans/src/Advection/topologically_conditional_interpolation.jl 36 _symmetric_interpolate_xᶜᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecisio...
150 0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl 20 advective_momentum_flux_Uu(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision...
150 0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl 20 overdub
150 150 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl ? overdub
157 0 @Base/simdloop.jl 77 macro expansion
160 0 @Oceananigans/src/Advection/weno_fifth_order.jl 117 overdub
160 0 @Oceananigans/src/Advection/weno_fifth_order.jl 182 overdub
160 0 @Oceananigans/src/TurbulenceClosures/viscous_dissipation_operators.jl 26 overdub
162 162 @FFTW/src/fft.jl 466 unsafe_execute!
162 0 @FFTW/src/fft.jl 727 *
166 0 @Base/math.jl 918 overdub
167 0 @Oceananigans/src/Advection/weno_fifth_order.jl 185 overdub
169 0 @Oceananigans/src/Fields/abstract_field.jl 190 getindex(::Field{Face, Center, Center, CPU, OffsetArrays.OffsetArray{Float64, 3, Array{Float64, 3}}, RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float...
172 0 @Oceananigans/src/Advection/weno_fifth_order.jl 133 overdub
172 0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl 127 overdub
174 0 @Oceananigans/src/Advection/weno_fifth_order.jl 102 left_biased_αx₂(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, ...
175 0 @Oceananigans/src/Advection/weno_fifth_order.jl 113 overdub
175 0 @Oceananigans/src/Advection/weno_fifth_order.jl 169 overdub
177 0 @Oceananigans/src/Models/NonhydrostaticModels/update_hydrostatic_pressure.jl 14 macro expansion
178 0 @Oceananigans/src/Advection/centered_fourth_order.jl 13 overdub
182 0 @Oceananigans/src/TurbulenceClosures/viscous_dissipation_operators.jl 33 overdub
186 0 @Oceananigans/src/Advection/weno_fifth_order.jl 175 overdub
187 0 @KernelAbstractions/src/extras/loopinfo.jl 26 macro expansion
197 0 @Oceananigans/src/Operators/difference_operators.jl 11 δyᵃᶠᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.Twice...
209 0 @Oceananigans/src/Operators/interpolation_operators.jl 21 overdub
209 0 @Oceananigans/src/Advection/centered_fourth_order.jl 22 overdub
209 0 @Oceananigans/src/Advection/weno_fifth_order.jl 11 overdub
217 0 @Oceananigans/src/Advection/weno_fifth_order.jl 100 left_biased_αx₀(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, ...
217 0 @Oceananigans/src/Advection/weno_fifth_order.jl 129 left_biased_weno5_weights_x(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecisio...
223 0 @Oceananigans/src/Solvers/discrete_transforms.jl 104 DiscreteTransform
223 0 none ? #31
223 0 @Oceananigans/src/Solvers/fft_based_poisson_solver.jl 49 solve_poisson_equation!(solver::Oceananigans.Solvers.FFTBasedPoissonSolver{CPU, RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Floa...
223 0 @Base/array.jl 702 collect_to_with_first!
223 0 @Base/array.jl 683 collect(itr::Base.Generator{Tuple{Oceananigans.Solvers.DiscreteTransform{FFTW.r2rFFTWPlan{ComplexF64, (5,), true, 3, Vector{Int64}}, Oceananigans.Solvers.Forward, CPU, RegularRectilinearGrid{Fl...
226 0 @Oceananigans/src/Advection/weno_fifth_order.jl 104 overdub
226 0 @Oceananigans/src/Advection/weno_fifth_order.jl 142 overdub
230 0 @Oceananigans/src/Operators/difference_operators.jl 11 overdub
234 0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl 69 overdub
238 0 @Oceananigans/src/Advection/weno_fifth_order.jl 106 overdub
238 0 @Oceananigans/src/Advection/weno_fifth_order.jl 144 overdub
248 0 @Oceananigans/src/Advection/weno_fifth_order.jl 136 overdub
254 0 @Base/array.jl 724 collect_to!(dest::Vector{Nothing}, itr::Base.Generator{Tuple{Oceananigans.Solvers.DiscreteTransform{FFTW.r2rFFTWPlan{ComplexF64, (5,), true, 3, Vector{Int64}}, Oceananigans.Solvers.Forward, CPU...
263 0 @Oceananigans/src/Advection/weno_fifth_order.jl 105 overdub
263 0 @Oceananigans/src/Advection/weno_fifth_order.jl 143 overdub
264 0 @Base/broadcast.jl 936 copyto!
264 0 @Base/broadcast.jl 894 materialize!
264 0 @Base/broadcast.jl 891 materialize!
265 265 @Oceananigans/src/Operators/difference_operators.jl ? overdub
267 0 @Oceananigans/src/Advection/weno_fifth_order.jl 188 overdub
268 0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl 42 overdub
268 0 @Oceananigans/src/Advection/weno_fifth_order.jl 112 overdub
268 0 @Oceananigans/src/Advection/weno_fifth_order.jl 168 overdub
278 0 @Oceananigans/src/Solvers/discrete_transforms.jl 112 DiscreteTransform
278 0 @Oceananigans/src/Solvers/fft_based_poisson_solver.jl 66 solve_poisson_equation!(solver::Oceananigans.Solvers.FFTBasedPoissonSolver{CPU, RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Floa...
279 0 @Oceananigans/src/Advection/weno_fifth_order.jl 118 overdub
279 0 @Oceananigans/src/Advection/weno_fifth_order.jl 183 overdub
280 0 @Oceananigans/src/Advection/weno_fifth_order.jl 216 left_biased_interpolate_yᵃᶠᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecisi...
292 0 @Oceananigans/src/TurbulenceClosures/viscous_dissipation_operators.jl 19 overdub
292 0 @Oceananigans/src/Advection/weno_fifth_order.jl 134 overdub
296 0 @Oceananigans/src/Advection/topologically_conditional_interpolation.jl 57 overdub
301 0 @Oceananigans/src/TimeSteppers/quasi_adams_bashforth_2.jl 121 macro expansion
306 0 @Oceananigans/src/Advection/weno_fifth_order.jl 114 overdub
306 0 @Oceananigans/src/Advection/weno_fifth_order.jl 170 overdub
308 0 @Base/array.jl 678 collect(itr::Base.Generator{Tuple{Oceananigans.Solvers.DiscreteTransform{FFTW.r2rFFTWPlan{ComplexF64, (5,), true, 3, Vector{Int64}}, Oceananigans.Solvers.Forward, CPU, RegularRectilinearGrid{Fl...
309 0 none ? #33
311 0 @Oceananigans/src/Fields/abstract_field.jl 190 getindex(::Field{Center, Center, Center, CPU, OffsetArrays.OffsetArray{Float64, 3, Array{Float64, 3}}, RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Flo...
311 0 @Oceananigans/src/Advection/weno_fifth_order.jl 116 overdub
311 0 @Oceananigans/src/Advection/weno_fifth_order.jl 181 overdub
316 0 @Oceananigans/src/Advection/weno_fifth_order.jl 147 overdub
316 316 @FFTW/src/fft.jl 496 unsafe_execute!
316 0 @FFTW/src/fft.jl 890 *
326 0 @Oceananigans/src/Advection/weno_fifth_order.jl 100 overdub
326 0 @Oceananigans/src/Advection/weno_fifth_order.jl 129 overdub
334 0 @Oceananigans/src/Advection/weno_fifth_order.jl 173 overdub
337 0 @Oceananigans/src/Operators/derivative_operators.jl 95 ∂yᶜᶠᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.Twice...
339 0 @Oceananigans/src/Advection/weno_fifth_order.jl 149 overdub
343 0 @Oceananigans/src/Advection/weno_fifth_order.jl 231 right_biased_interpolate_yᵃᶠᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecis...
344 0 @Oceananigans/src/Operators/derivative_operators.jl 95 overdub
356 0 @Oceananigans/src/Advection/weno_fifth_order.jl 101 overdub
356 0 @Oceananigans/src/Advection/weno_fifth_order.jl 130 overdub
366 0 @Oceananigans/src/Advection/topologically_conditional_interpolation.jl 36 _left_biased_interpolate_yᵃᶠᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecis...
366 0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl 115 overdub
393 0 @Oceananigans/src/Advection/weno_fifth_order.jl 226 right_biased_interpolate_xᶠᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecis...
393 0 @Oceananigans/src/Advection/weno_fifth_order.jl 211 left_biased_interpolate_xᶠᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecisi...
397 0 @Oceananigans/src/Advection/topologically_conditional_interpolation.jl 36 _left_biased_interpolate_xᶠᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecis...
397 0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl 75 overdub
423 0 @Oceananigans/src/Fields/abstract_field.jl 200 overdub
428 0 @Base/array.jl 841 setindex!(::Array{Float64, 3}, ::Float64, ::Int64, ::Int64, ::Int64)
428 0 @OffsetArrays/src/OffsetArrays.jl 430 overdub
429 0 @Oceananigans/src/Advection/weno_fifth_order.jl 212 overdub
430 0 @Base/promotion.jl 324 /(::Float64, ::Int64)
430 0 @Base/promotion.jl 324 overdub
431 0 @Oceananigans/src/Advection/topologically_conditional_interpolation.jl 36 _right_biased_interpolate_yᵃᶠᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePreci...
431 0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl 116 overdub
437 0 @Oceananigans/src/Models/NonhydrostaticModels/pressure_correction.jl 40 macro expansion
440 0 @Oceananigans/src/Advection/weno_fifth_order.jl 217 overdub
446 0 @Oceananigans/src/Advection/weno_fifth_order.jl 102 overdub
446 0 @Oceananigans/src/Advection/weno_fifth_order.jl 131 overdub
458 0 @Base/array.jl 841 overdub
466 0 @Oceananigans/src/Advection/weno_fifth_order.jl 227 overdub
479 0 @Oceananigans/src/Advection/weno_fifth_order.jl 211 left_biased_interpolate_xᶠᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecisi...
494 0 @Oceananigans/src/Advection/weno_fifth_order.jl 226 right_biased_interpolate_xᶠᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecis...
501 0 @Oceananigans/src/Solvers/discrete_transforms.jl 136 apply_transform!
517 0 @Oceananigans/src/Advection/topologically_conditional_interpolation.jl 36 _right_biased_interpolate_xᶠᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePreci...
517 0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl 76 overdub
538 0 @Oceananigans/src/Advection/weno_fifth_order.jl 186 overdub
551 0 @Oceananigans/src/Advection/weno_fifth_order.jl 216 left_biased_interpolate_yᵃᶠᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecisi...
552 0 @Oceananigans/src/Advection/weno_fifth_order.jl 226 right_biased_interpolate_xᶠᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecis...
555 0 @Oceananigans/src/Advection/weno_fifth_order.jl 135 overdub
562 0 @Base/generator.jl 47 iterate
576 0 @Oceananigans/src/Advection/weno_fifth_order.jl 240 left_biased_interpolate_xᶜᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecisi...
576 0 @Oceananigans/src/Advection/weno_fifth_order.jl 240 overdub
576 0 @Oceananigans/src/Advection/topologically_conditional_interpolation.jl 36 _left_biased_interpolate_xᶜᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecis...
576 0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl 21 overdub
590 0 @Oceananigans/src/Advection/weno_fifth_order.jl 244 right_biased_interpolate_xᶜᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecis...
590 0 @Oceananigans/src/Advection/weno_fifth_order.jl 244 overdub
590 0 @Oceananigans/src/Advection/topologically_conditional_interpolation.jl 36 _right_biased_interpolate_xᶜᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePreci...
590 0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl 22 overdub
604 0 @Oceananigans/src/Solvers/solve_for_pressure.jl 7 solve_for_pressure!
604 0 @Oceananigans/src/Models/NonhydrostaticModels/pressure_correction.jl 20 calculate_pressure_correction!(model::NonhydrostaticModel{Oceananigans.TimeSteppers.QuasiAdamsBashforth2TimeStepper{Float64, NamedTuple{(:u, :v, :w, :b), Tuple{Field{Face, Center, Center, CPU, ...
605 0 @Oceananigans/src/Advection/weno_fifth_order.jl 232 overdub
624 0 @Oceananigans/src/Advection/weno_fifth_order.jl 216 left_biased_interpolate_yᵃᶠᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecisi...
655 0 @Oceananigans/src/Advection/weno_fifth_order.jl 226 right_biased_interpolate_xᶠᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecis...
665 0 @Oceananigans/src/TimeSteppers/quasi_adams_bashforth_2.jl 55 time_step!(model::NonhydrostaticModel{Oceananigans.TimeSteppers.QuasiAdamsBashforth2TimeStepper{Float64, NamedTuple{(:u, :v, :w, :b), Tuple{Field{Face, Center, Center, CPU, OffsetArrays.OffsetA...
672 0 @Oceananigans/src/Fields/abstract_field.jl 190 overdub
677 0 @Oceananigans/src/Advection/weno_fifth_order.jl 231 right_biased_interpolate_yᵃᶠᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecis...
681 0 @Base/array.jl 802 getindex(::Array{Float64, 3}, ::Int64, ::Int64, ::Int64)
681 0 @OffsetArrays/src/OffsetArrays.jl 409 overdub
685 0 @Oceananigans/src/Advection/topologically_conditional_interpolation.jl 36 _right_biased_interpolate_xᶠᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePreci...
685 0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl 107 overdub
690 0 @Oceananigans/src/Advection/weno_fifth_order.jl 174 overdub
698 0 @Base/array.jl 802 overdub
706 0 @Oceananigans/src/Advection/topologically_conditional_interpolation.jl 36 _left_biased_interpolate_yᵃᶠᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecis...
706 0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl 30 overdub
712 0 @Oceananigans/src/Advection/momentum_advection_operators.jl 18 _advective_momentum_flux_Ww(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecisio...
712 0 @Oceananigans/src/Advection/momentum_advection_operators.jl 18 overdub
725 0 @Oceananigans/src/Advection/weno_fifth_order.jl 231 right_biased_interpolate_yᵃᶠᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecis...
726 0 @Oceananigans/src/Advection/weno_fifth_order.jl 231 right_biased_interpolate_yᵃᶠᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecis...
731 0 @Oceananigans/src/Advection/weno_fifth_order.jl 241 left_biased_interpolate_yᵃᶜᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecisi...
731 0 @Oceananigans/src/Advection/weno_fifth_order.jl 241 overdub
731 0 @Oceananigans/src/Advection/topologically_conditional_interpolation.jl 36 _left_biased_interpolate_yᵃᶜᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecis...
731 0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl 57 overdub
736 0 @Oceananigans/src/Operators/difference_operators.jl 27 δzᵃᵃᶠ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.Twice...
745 0 @Oceananigans/src/TimeSteppers/quasi_adams_bashforth_2.jl 44 time_step!##kw
745 0 @Oceananigans/src/Simulations/run.jl 68 #ab2_or_rk3_time_step!#5
745 0 @Oceananigans/src/Simulations/run.jl 68 ab2_or_rk3_time_step!##kw
745 0 @Oceananigans/src/Simulations/run.jl 177 run!(sim::Simulation{NonhydrostaticModel{Oceananigans.TimeSteppers.QuasiAdamsBashforth2TimeStepper{Float64, NamedTuple{(:u, :v, :w, :b), Tuple{Field{Face, Center, Center, CPU, OffsetArrays.Offs...
749 0 @Oceananigans/src/Advection/weno_fifth_order.jl 187 overdub
754 0 @Oceananigans/src/Operators/difference_operators.jl 27 overdub
760 0 @Oceananigans/src/Advection/weno_fifth_order.jl 211 left_biased_interpolate_xᶠᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecisi...
763 0 @Oceananigans/src/Advection/weno_fifth_order.jl 211 left_biased_interpolate_xᶠᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecisi...
768 0 @Oceananigans/src/Advection/topologically_conditional_interpolation.jl 36 _right_biased_interpolate_xᶠᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePreci...
768 0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl 49 overdub
781 0 @Oceananigans/src/Simulations/run.jl 127 run!(sim::Simulation{NonhydrostaticModel{Oceananigans.TimeSteppers.QuasiAdamsBashforth2TimeStepper{Float64, NamedTuple{(:u, :v, :w, :b), Tuple{Field{Face, Center, Center, CPU, OffsetArrays.Offs...
781 0 @Base/boot.jl 360 eval
781 0 @Base/loading.jl 1116 include_string(mapexpr::typeof(identity), mod::Module, code::String, filename::String)
781 0 @Base/loading.jl 1170 _include(mapexpr::Function, mod::Module, _path::String)
781 0 @Base/Base.jl 386 include(mod::Module, _path::String)
781 0 @Base/client.jl 285 exec_options(opts::Base.JLOptions)
781 0 @Base/client.jl 485 _start()
796 796 @Cassette/src/context.jl ? overdub
821 0 @Oceananigans/src/Advection/topologically_conditional_interpolation.jl 36 _right_biased_interpolate_yᵃᶠᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePreci...
821 0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl 85 overdub
860 860 @KernelAbstractions/src/compiler/contract.jl 18 sub_float_contract
860 0 @KernelAbstractions/src/compiler.jl 46 overdub
873 0 @Oceananigans/src/Advection/weno_fifth_order.jl 148 overdub
879 0 @Oceananigans/src/Operators/difference_operators.jl 23 δyᵃᶜᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.Twice...
903 0 @Oceananigans/src/Advection/topologically_conditional_interpolation.jl 36 _right_biased_interpolate_yᵃᶠᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePreci...
903 0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl 31 overdub
911 0 @Oceananigans/src/Advection/topologically_conditional_interpolation.jl 36 _left_biased_interpolate_xᶠᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecis...
911 0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl 106 overdub
921 0 @Oceananigans/src/Advection/weno_fifth_order.jl 245 right_biased_interpolate_yᵃᶜᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecis...
921 0 @Oceananigans/src/Advection/weno_fifth_order.jl 245 overdub
921 0 @Oceananigans/src/Advection/topologically_conditional_interpolation.jl 36 _right_biased_interpolate_yᵃᶜᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePreci...
921 0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl 58 overdub
926 0 @Oceananigans/src/Advection/weno_fifth_order.jl 216 left_biased_interpolate_yᵃᶠᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecisi...
940 0 @Oceananigans/src/Advection/topologically_conditional_interpolation.jl 36 _left_biased_interpolate_xᶠᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecis...
940 0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl 48 overdub
941 0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl 10 upwind_biased_product(::Float64, ::Float64, ::Float64)
941 0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl 10 overdub
1000 0 @Oceananigans/src/Advection/momentum_advection_operators.jl 10 _advective_momentum_flux_Wu(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecisio...
1000 0 @Oceananigans/src/Advection/momentum_advection_operators.jl 10 overdub
1018 0 @Oceananigans/src/Advection/topologically_conditional_interpolation.jl 36 _left_biased_interpolate_yᵃᶠᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecis...
1018 0 @Oceananigans/src/Advection/upwind_biased_advective_fluxes.jl 84 overdub
1022 0 @Oceananigans/src/Advection/momentum_advection_operators.jl 14 _advective_momentum_flux_Wv(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecisio...
1022 0 @Oceananigans/src/Advection/momentum_advection_operators.jl 14 overdub
1048 0 @Oceananigans/src/Operators/difference_operators.jl 26 δzᵃᵃᶜ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.Twice...
1051 0 @Oceananigans/src/Operators/difference_operators.jl 26 δzᵃᵃᶜ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.Twice...
1127 0 @Oceananigans/src/Advection/momentum_advection_operators.jl 16 _advective_momentum_flux_Uw(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecisio...
1127 0 @Oceananigans/src/Advection/momentum_advection_operators.jl 16 overdub
1127 0 @Oceananigans/src/Operators/difference_operators.jl 20 δxᶜᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.Twice...
1148 1148 @Cassette/src/context.jl 456 call
1156 0 @Cassette/src/context.jl 454 fallback
1156 0 @Cassette/src/overdub.jl 582 _overdub_fallback(::Any, ::Vararg{Any, N} where N)
1156 0 @Cassette/src/overdub.jl 582 overdub
1229 1229 @Oceananigans/src/Advection/topologically_conditional_interpolation.jl ? __thread_run(tid::Int64, len::Int64, rem::Int64, obj::KernelAbstractions.Kernel{KernelAbstractions.CPU, KernelAbstractions.NDIteration.StaticSize{(16, 16)}, KernelAbstractions.NDIteration.Stati...
1318 0 @Oceananigans/src/Operators/difference_operators.jl 26 δzᵃᵃᶜ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.Twice...
1363 0 @Oceananigans/src/Advection/momentum_advection_operators.jl 8 _advective_momentum_flux_Uu(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecisio...
1363 0 @Oceananigans/src/Advection/momentum_advection_operators.jl 8 overdub
1363 0 @Oceananigans/src/Operators/difference_operators.jl 21 δxᶠᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.Twice...
1372 0 @Oceananigans/src/Operators/difference_operators.jl 21 overdub
1602 0 @Oceananigans/src/Operators/difference_operators.jl 20 δxᶜᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.Twice...
1681 1681 @KernelAbstractions/src/compiler/contract.jl 18 mul_float_contract
1681 0 @KernelAbstractions/src/compiler.jl 47 overdub
1714 0 @Oceananigans/src/Advection/momentum_advection_operators.jl 9 _advective_momentum_flux_Vu(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecisio...
1714 0 @Oceananigans/src/Advection/momentum_advection_operators.jl 9 overdub
1714 0 @Oceananigans/src/Operators/difference_operators.jl 23 δyᵃᶜᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.Twice...
1781 0 @Oceananigans/src/Advection/momentum_advection_operators.jl 13 _advective_momentum_flux_Vv(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecisio...
1781 0 @Oceananigans/src/Advection/momentum_advection_operators.jl 13 overdub
1781 0 @Oceananigans/src/Operators/difference_operators.jl 24 δyᵃᶠᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.Twice...
1791 0 @Oceananigans/src/Operators/difference_operators.jl 24 overdub
1797 0 @Oceananigans/src/Advection/momentum_advection_operators.jl 12 _advective_momentum_flux_Uv(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecisio...
1797 0 @Oceananigans/src/Advection/momentum_advection_operators.jl 12 overdub
1797 0 @Oceananigans/src/Operators/difference_operators.jl 20 δxᶜᵃᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.Twice...
1970 0 @Oceananigans/src/Advection/momentum_advection_operators.jl 17 _advective_momentum_flux_Vw(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecisio...
1970 0 @Oceananigans/src/Advection/momentum_advection_operators.jl 17 overdub
1970 0 @Oceananigans/src/Operators/difference_operators.jl 23 δyᵃᶜᵃ(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.Twice...
2094 0 @Oceananigans/src/Advection/weno_fifth_order.jl 226 overdub
2183 0 @Base/operators.jl 560 +(::Float64, ::Float64, ::Float64)
2262 0 @Base/operators.jl 560 overdub
2381 0 @Oceananigans/src/Advection/weno_fifth_order.jl 216 overdub
2395 0 @Oceananigans/src/Advection/weno_fifth_order.jl 211 overdub
2471 0 @Oceananigans/src/Advection/weno_fifth_order.jl 231 overdub
3033 3033 @KernelAbstractions/src/compiler/contract.jl 18 add_float_contract
3033 0 @KernelAbstractions/src/compiler.jl 45 overdub
3688 0 @Oceananigans/src/Operators/difference_operators.jl 26 overdub
4069 0 @Oceananigans/src/Advection/tracer_advection_operators.jl 28 div_Uc(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.Twic...
4069 0 @Oceananigans/src/Advection/tracer_advection_operators.jl 28 overdub
4202 0 @Oceananigans/src/Models/NonhydrostaticModels/velocity_and_tracer_tendencies.jl 186 overdub
4469 4469 @Oceananigans/src/Advection/topologically_conditional_interpolation.jl ? overdub
4570 0 @Oceananigans/src/Advection/momentum_advection_operators.jl 57 div_𝐯u(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.Twic...
4570 0 @Oceananigans/src/Advection/momentum_advection_operators.jl 57 overdub
4651 0 @Oceananigans/src/Operators/difference_operators.jl 20 overdub
4791 0 @Oceananigans/src/Operators/difference_operators.jl 23 overdub
4979 0 @Oceananigans/src/Models/NonhydrostaticModels/velocity_and_tracer_tendencies.jl 45 u_velocity_tendency(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float6...
4979 0 @Oceananigans/src/Models/NonhydrostaticModels/velocity_and_tracer_tendencies.jl 45 overdub
5018 0 @Oceananigans/src/Advection/momentum_advection_operators.jl 87 div_𝐯w(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.Twic...
5018 0 @Oceananigans/src/Advection/momentum_advection_operators.jl 87 overdub
5247 0 @Oceananigans/src/Models/NonhydrostaticModels/velocity_and_tracer_tendencies.jl 139 w_velocity_tendency(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float6...
5247 0 @Oceananigans/src/Models/NonhydrostaticModels/velocity_and_tracer_tendencies.jl 139 overdub
5452 0 @Oceananigans/src/Advection/momentum_advection_operators.jl 72 div_𝐯v(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.Twic...
5452 0 @Oceananigans/src/Advection/momentum_advection_operators.jl 72 overdub
5694 0 @Oceananigans/src/Models/NonhydrostaticModels/velocity_and_tracer_tendencies.jl 94 v_velocity_tendency(::Int64, ::Int64, ::Int64, ::RegularRectilinearGrid{Float64, Periodic, Periodic, Bounded, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float6...
5694 0 @Oceananigans/src/Models/NonhydrostaticModels/velocity_and_tracer_tendencies.jl 94 overdub
9102 9102 @Base/float.jl 335 /(::Float64, ::Float64)
9102 0 @Base/float.jl 335 overdub
11777 0 @Oceananigans/src/Advection/topologically_conditional_interpolation.jl 36 overdub
21668 0 @KernelAbstractions/src/macros.jl 266 overdub
21676 4 @KernelAbstractions/src/cpu.jl 157 __thread_run(tid::Int64, len::Int64, rem::Int64, obj::KernelAbstractions.Kernel{KernelAbstractions.CPU, KernelAbstractions.NDIteration.StaticSize{(16, 16)}, KernelAbstractions.NDIteration.Stati...
22912 0 @KernelAbstractions/src/cpu.jl 130 __run(obj::KernelAbstractions.Kernel{KernelAbstractions.CPU, KernelAbstractions.NDIteration.StaticSize{(16, 16)}, KernelAbstractions.NDIteration.StaticSize{(128, 128)}, typeof(Oceananigans.Boun...
22912 0 @KernelAbstractions/src/cpu.jl 22 (::KernelAbstractions.var"#33#34"{Tuple{KernelAbstractions.NoneEvent}, Nothing, typeof(KernelAbstractions.__run), Tuple{KernelAbstractions.Kernel{KernelAbstractions.CPU, KernelAbstractions.NDIt...
Total snapshots: 24177
Thanks @hennyg888 for sharing these results.
I presume this is with Profile
and not ProfileView
as we don't see the percentages spent on each function?
ProfileView creates a flame graph:
https://github.com/timholy/ProfileView.jl
I haven't seen a text-based profile viewer that shows percentages like you describe @francispoulin .
ProfileView creates a flame graph:
https://github.com/timholy/ProfileView.jl
I haven't seen a text-based profile viewer that shows percentages like you describe @francispoulin .
Thanks for clarifying and sorry for my misunderstanding
Interesting that ab2
has 745
counts, which is much lower relatively than what we saw in the GPU case.
No worries don't apologies! I made the same mistake after reading Hendrik Ranocha's blog post and seeing
But this is actually the output of benchmarking on individual components of the time-stepping scheme.
I think it'd be a good idea to setup similar microbenchmarks of the time-stepping components (update_state!
, calculate_tendencies!
, etc). This is not quite the same as profiling but yields slightly more precise and also more digestible information about timings and relative cost of things per time-step.
@hennyg888 I think we need line info (not just file) to precisely interpret the profiling results?
By the way, ProfileView.jl
does not play nice with multithreaded programs so we can't use it. I tried StatProfilerHTML
and liked it:
@hennyg888 I think we need line info (not just file) to precisely interpret the profiling results?
If you scroll right in my big block of text you can see a column that shows the line number and function name in the file specified in the file column that's visible without scrolling. Please see the full file attached below. Might be easier to view or reformat than the embedded code block above. nonhydrostatic_profile_flat.txt
I tried to avoid flame graphs and go for something as close to percentages as I could so I went with the default output. I'll add in StatProfilerHTML.jl outputs as well since the flame graphs and html files do look very neat. In the very last row there's a total snapshots count of 24177. Dividing the counts shown in the left-most column by this number should give the percentage time spent on this line or in any functions executed by this line.
Profiling results for the nonhydrostatic model on GPU with the script found in #1914. This was done on Satori, and with the WENO5 advection scheme and AB2 timestepper with the grid size being 128^3. Now it seems that timestepping takes less than 5% of the time and what should be taking up the largest chunks of time are doing so.
Oceananigans v0.60.0
Julia Version 1.6.2
Commit 1b93d53fc4 (2021-07-14 15:36 UTC)
Platform Info:
OS: Linux (powerpc64le-unknown-linux-gnu)
CPU: unknown
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, pwr9)
GPU: Tesla V100-SXM2-32GB
CUDA toolkit 10.2.89, local installation
CUDA driver 10.2.0
NVIDIA driver 440.64.0
Libraries:
- CUBLAS: 10.2.2
- CURAND: 10.1.2
- CUFFT: 10.1.2
- CUSOLVER: 10.3.0
- CUSPARSE: 10.3.1
- CUPTI: 12.0.0
- NVML: 10.0.0+440.64.0
- CUDNN: missing
- CUTENSOR: missing
Toolchain:
- Julia: 1.6.2
- LLVM: 11.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5
- Device capability support: sm_30, sm_32, sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75
2 devices:
0: Tesla V100-SXM2-32GB (sm_70, 4.367 GiB / 31.749 GiB available)
1: Tesla V100-SXM2-32GB (sm_70, 4.805 GiB / 31.749 GiB available)
nothing
[2021/08/05 12:11:43.425] INFO Setting up benchmark: (GPU, Float64, 128)...
[2021/08/05 12:12:45.688] INFO warming up
[2021/08/05 12:15:06.837] INFO Simulation is stopping. Model iteration 1 has hit or exceeded simulation stop iteration 1.
[2021/08/05 12:15:07.841] INFO Simulation is stopping. Model iteration 11 has hit or exceeded simulation stop iteration 11.
[2021/08/05 12:15:10.060] INFO done profiling (GPU, Float64, 128)
==45925== Profiling application: /nobackup/users/henryguo/projects/henry-test/julia-1.6.2/bin/julia --project nonhydrostatic_profiler.jl
==45925== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 20.46% 17.966ms 10 1.7966ms 1.7946ms 1.7987ms _Z23julia_gpu_calculate_Gv_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu_calculate_Gv_16CompilerMetadataI10StaticSizeI15_128__128__128_E12DynamicCheckvv7NDRangeILi3ES5_I11_8__8__128_ES5_I11_16__16__1_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE22RegularRectilinearGridIS9_8PeriodicS12_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES15_IS9_EEEE5WENO5vv20IsotropicDiffusivityI26ExplicitTimeDiscretizationS9_10NamedTupleI5__b__5TupleIS9_EEE8BuoyancyI14BuoyancyTracer10ZDirectionES19_I23__velocities___tracers_S20_IS19_I12__u___v___w_S20_I9ZeroFieldS24_S24_EES19_I5__b__S20_IS24_EEEES19_I12__u___v___w_S20_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEES19_I5__b__S20_IS8_IS9_Li3ES10_IS9_Li3ELi1EEEEEvS19_I16__u___v___w___b_S20_I12_zeroforcingS25_S25_S25_EES8_IS9_Li3ES10_IS9_Li3ELi1EEES19_I27__time___iteration___stage_S20_IS9_5Int64S26_EE
19.93% 17.500ms 10 1.7500ms 1.7462ms 1.7527ms _Z23julia_gpu_calculate_Gu_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu_calculate_Gu_16CompilerMetadataI10StaticSizeI15_128__128__128_E12DynamicCheckvv7NDRangeILi3ES5_I11_8__8__128_ES5_I11_16__16__1_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE22RegularRectilinearGridIS9_8PeriodicS12_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES15_IS9_EEEE5WENO5vv20IsotropicDiffusivityI26ExplicitTimeDiscretizationS9_10NamedTupleI5__b__5TupleIS9_EEE8BuoyancyI14BuoyancyTracer10ZDirectionES19_I23__velocities___tracers_S20_IS19_I12__u___v___w_S20_I9ZeroFieldS24_S24_EES19_I5__b__S20_IS24_EEEES19_I12__u___v___w_S20_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEES19_I5__b__S20_IS8_IS9_Li3ES10_IS9_Li3ELi1EEEEEvS19_I16__u___v___w___b_S20_I12_zeroforcingS25_S25_S25_EES8_IS9_Li3ES10_IS9_Li3ELi1EEES19_I27__time___iteration___stage_S20_IS9_5Int64S26_EE
12.91% 11.333ms 10 1.1333ms 1.1288ms 1.1414ms _Z23julia_gpu_calculate_Gw_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu_calculate_Gw_16CompilerMetadataI10StaticSizeI15_128__128__128_E12DynamicCheckvv7NDRangeILi3ES5_I11_8__8__128_ES5_I11_16__16__1_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE22RegularRectilinearGridIS9_8PeriodicS12_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES15_IS9_EEEE5WENO5vv20IsotropicDiffusivityI26ExplicitTimeDiscretizationS9_10NamedTupleI5__b__5TupleIS9_EEE8BuoyancyI14BuoyancyTracer10ZDirectionES19_I23__velocities___tracers_S20_IS19_I12__u___v___w_S20_I9ZeroFieldS24_S24_EES19_I5__b__S20_IS24_EEEES19_I12__u___v___w_S20_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEES19_I5__b__S20_IS8_IS9_Li3ES10_IS9_Li3ELi1EEEEEvS19_I16__u___v___w___b_S20_I12_zeroforcingS25_S25_S25_EES19_I27__time___iteration___stage_S20_IS9_5Int64S26_EE
8.89% 7.8028ms 10 780.28us 778.01us 783.13us _Z23julia_gpu_calculate_Gc_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu_calculate_Gc_16CompilerMetadataI10StaticSizeI15_128__128__128_E12DynamicCheckvv7NDRangeILi3ES5_I11_8__8__128_ES5_I11_16__16__1_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE22RegularRectilinearGridIS9_8PeriodicS12_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES15_IS9_EEEE3ValILi1EE5WENO520IsotropicDiffusivityI26ExplicitTimeDiscretizationS9_10NamedTupleI5__b__5TupleIS9_EEE8BuoyancyI14BuoyancyTracer10ZDirectionES20_I23__velocities___tracers_S21_IS20_I12__u___v___w_S21_I9ZeroFieldS25_S25_EES20_I5__b__S21_IS25_EEEES20_I12__u___v___w_S21_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEES20_I5__b__S21_IS8_IS9_Li3ES10_IS9_Li3ELi1EEEEEv12_zeroforcingS20_I27__time___iteration___stage_S21_IS9_5Int64S27_EE
4.74% 4.1650ms 40 104.12us 97.055us 111.17us _Z25julia_gpu_ab2_step_field_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE20_gpu_ab2_step_field_16CompilerMetadataI10StaticSizeI15_128__128__128_E12DynamicCheckvv7NDRangeILi3ES5_I11_8__8__128_ES5_I11_16__16__1_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE5Int64S9_S8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEE
4.17% 3.6600ms 40 91.499us 88.448us 95.808us void regular_fft<unsigned int=128, unsigned int=8, unsigned int=16, padding_t=1, twiddle_t=0, loadstore_modifier_t=2, layout_t=1, unsigned int, double>(kernel_arguments_t<unsigned int>)
2.53% 2.2193ms 40 55.482us 54.623us 56.192us _Z33julia_gpu_store_field_tendencies_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE28_gpu_store_field_tendencies_16CompilerMetadataI10StaticSizeI15_128__128__128_E12DynamicCheckvv7NDRangeILi3ES5_I11_8__8__128_ES5_I11_16__16__1_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE22RegularRectilinearGridIS9_8PeriodicS12_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES15_IS9_EEEES8_IS9_Li3ES10_IS9_Li3ELi1EEE
2.09% 1.8318ms 10 183.18us 180.90us 184.51us _Z39julia_gpu__pressure_correct_velocities_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE34_gpu__pressure_correct_velocities_16CompilerMetadataI10StaticSizeI15_128__128__128_E12DynamicCheckvv7NDRangeILi3ES5_I11_8__8__128_ES5_I11_16__16__1_EvvEE10NamedTupleI12__u___v___w_5TupleI11OffsetArrayI7Float64Li3E13CuDeviceArrayIS11_Li3ELi1EEES10_IS11_Li3ES12_IS11_Li3ELi1EEES10_IS11_Li3ES12_IS11_Li3ELi1EEEEE22RegularRectilinearGridIS11_8PeriodicS14_7BoundedS10_IS11_Li1E12StepRangeLenIS11_14TwicePrecisionIS11_ES17_IS11_EEEE5Int64S10_IS11_Li3ES12_IS11_Li3ELi1EEE
2.07% 1.8141ms 20 90.705us 88.448us 92.864us [CUDA memcpy DtoD]
2.05% 1.7988ms 190 9.4670us 6.3680us 14.592us _Z27julia_broadcast_kernel_478815CuKernelContext8SubArrayI7Float64Li3E13CuDeviceArrayIS1_Li3ELi1EE5TupleI9UnitRangeI5Int64E5SliceI5OneToIS5_EES6_IS7_IS5_EEELifalseEE11BroadcastedIvS3_IS7_IS5_ES7_IS5_ES7_IS5_EE9_identityS3_I8ExtrudedIS0_IS1_Li3ES2_IS1_Li3ELi1EES3_IS4_IS5_ES6_IS7_IS5_EES6_IS7_IS5_EEELifalseEES3_I4BoolS11_S11_ES3_IS5_S5_S5_EEEES5_
2.03% 1.7807ms 20 89.036us 86.687us 91.328us void vector_fft<unsigned int=128, unsigned int=8, unsigned int=2, padding_t=6, twiddle_t=0, loadstore_modifier_t=2, layout_t=0, unsigned int, double>(kernel_arguments_t<unsigned int>)
1.97% 1.7324ms 10 173.24us 171.10us 174.98us julia_broadcast_kernel_20870(CuKernelContext, CuDeviceArray<Complex<Float64>, int=3, int=1>, Broadcasted<void, Tuple<OneTo<Int64>, Broadcasted<Tuple>, Broadcasted<Tuple>>, _real, CuDeviceArray<Complex<Float64>, int=3, int=1, Extruded<CuDeviceArray<Complex<Float64>, int=3, int=1>, CuDeviceArray<Complex<Float64>, int=3, int=1, Bool, OneTo<Int64>, OneTo<Int64>>, CuDeviceArray<Complex<Float64>, int=3, int=1, Tuple, Tuple, Tuple>>>>, Tuple)
1.93% 1.6951ms 20 84.753us 83.871us 85.599us void scal_kernel_val<double2, double>(cublasScalParamsVal<double2, double>)
1.66% 1.4567ms 10 145.67us 144.29us 147.58us _Z28julia_broadcast_kernel_2031515CuKernelContext13CuDeviceArrayI7ComplexI7Float64ELi3ELi1EE11BroadcastedIv5TupleI5OneToI5Int64ES5_IS6_ES5_IS6_EE2__S4_IS6_S3_I12CuArrayStyleILi3EEv5_realS4_IS3_IS8_ILi3EEvS7_S4_I8ExtrudedIS0_IS1_IS2_ELi3ELi1EES4_I4BoolS11_S11_ES4_IS6_S6_S6_EES10_IS0_IS1_IS2_ELi3ELi1EES4_IS11_S11_S11_ES4_IS6_S6_S6_EEEEEEEES6_
1.61% 1.4105ms 10 141.05us 139.39us 143.17us _Z58julia_gpu_calculate_pressure_source_term_fft_based_solver_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE53_gpu_calculate_pressure_source_term_fft_based_solver_16CompilerMetadataI10StaticSizeI15_128__128__128_E12DynamicCheckvv7NDRangeILi3ES5_I11_8__8__128_ES5_I11_16__16__1_EvvEE13CuDeviceArrayI7ComplexI7Float64ELi3ELi1EE22RegularRectilinearGridIS10_8PeriodicS12_7Bounded11OffsetArrayIS10_Li1E12StepRangeLenIS10_14TwicePrecisionIS10_ES16_IS10_EEEE5Int6410NamedTupleI12__u___v___w_5TupleIS14_IS10_Li3ES8_IS10_Li3ELi1EEES14_IS10_Li3ES8_IS10_Li3ELi1EEES14_IS10_Li3ES8_IS10_Li3ELi1EEEEE
1.32% 1.1596ms 10 115.96us 114.50us 117.31us _Z28julia_gpu_permute_z_indices_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE23_gpu_permute_z_indices_16CompilerMetadataI10StaticSizeI15_128__128__128_E12DynamicCheckvv7NDRangeILi3ES5_I11_8__8__128_ES5_I11_16__16__1_EvvEE13CuDeviceArrayI7ComplexI7Float64ELi3ELi1EES8_IS9_IS10_ELi3ELi1EE22RegularRectilinearGridIS10_8PeriodicS12_7Bounded11OffsetArrayIS10_Li1E12StepRangeLenIS10_14TwicePrecisionIS10_ES16_IS10_EEEE
1.31% 1.1496ms 10 114.96us 113.86us 116.48us _Z30julia_gpu_unpermute_z_indices_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE25_gpu_unpermute_z_indices_16CompilerMetadataI10StaticSizeI15_128__128__128_E12DynamicCheckvv7NDRangeILi3ES5_I11_8__8__128_ES5_I11_16__16__1_EvvEE13CuDeviceArrayI7ComplexI7Float64ELi3ELi1EES8_IS9_IS10_ELi3ELi1EE22RegularRectilinearGridIS10_8PeriodicS12_7Bounded11OffsetArrayIS10_Li1E12StepRangeLenIS10_14TwicePrecisionIS10_ES16_IS10_EEEE
1.25% 1.0947ms 11 99.522us 97.696us 100.64us _Z38julia_gpu_update_hydrostatic_pressure_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE33_gpu_update_hydrostatic_pressure_16CompilerMetadataI10StaticSizeI10_128__128_E12DynamicCheckvv7NDRangeILi2ES5_I6_8__8_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE22RegularRectilinearGridIS9_8PeriodicS12_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES15_IS9_EEEE8BuoyancyI14BuoyancyTracer10ZDirectionE10NamedTupleI5__b__5TupleIS8_IS9_Li3ES10_IS9_Li3ELi1EEEEE
1.15% 1.0115ms 10 101.15us 100.32us 101.98us _Z28julia_broadcast_kernel_2045215CuKernelContext13CuDeviceArrayI7ComplexI7Float64ELi3ELi1EE11BroadcastedIv5TupleI5OneToI5Int64ES5_IS6_ES5_IS6_EE2__S4_IS3_I12CuArrayStyleILi3EEvS7_S4_I8ExtrudedIS0_IS1_IS2_ELi3ELi1EES4_I4BoolS10_S10_ES4_IS6_S6_S6_EEEES3_IS8_ILi3EEvS7_S4_IS3_IS8_ILi3EEvS7_S4_IS9_IS0_IS2_Li3ELi1EES4_IS10_S10_S10_ES4_IS6_S6_S6_EES9_IS0_IS2_Li3ELi1EES4_IS10_S10_S10_ES4_IS6_S6_S6_EES9_IS0_IS2_Li3ELi1EES4_IS10_S10_S10_ES4_IS6_S6_S6_EEEES6_EEEES6_
1.11% 974.43us 190 5.1280us 4.6080us 6.9760us _Z27julia_broadcast_kernel_491915CuKernelContext8SubArrayI7Float64Li3E13CuDeviceArrayIS1_Li3ELi1EE5TupleI5SliceI5OneToI5Int64EE9UnitRangeIS6_ES4_IS5_IS6_EEELifalseEE11BroadcastedIvS3_IS5_IS6_ES5_IS6_ES5_IS6_EE9_identityS3_I8ExtrudedIS0_IS1_Li3ES2_IS1_Li3ELi1EES3_IS4_IS5_IS6_EES7_IS6_ES4_IS5_IS6_EEELifalseEES3_I4BoolS11_S11_ES3_IS6_S6_S6_EEEES6_
1.03% 905.27us 10 90.527us 90.239us 91.007us julia_broadcast_kernel_20610(CuKernelContext, CuDeviceArray<Complex<Float64>, int=3, int=1>, Broadcasted<void, Tuple<OneTo<Int64>, Broadcasted<Tuple>, Broadcasted<Tuple>>, __, CuDeviceArray<Complex<Float64>, int=3, int=1, Extruded<CuDeviceArray<Complex<Float64>, int=3, int=1>, CuDeviceArray<Complex<Float64>, int=3, int=1, Bool, OneTo<Int64>, OneTo<Int64>>, CuDeviceArray<Complex<Float64>, int=3, int=1, Tuple, Tuple, Tuple>>, Int64<CuDeviceArray<Complex<Float64>, int=3, int=1>, CuDeviceArray<Complex<Float64>, int=3, int=1, OneTo<Int64>, OneTo<Int64>, OneTo<Int64>>, CuDeviceArray<Complex<Float64>, int=3, int=1, Tuple, Tuple, Tuple>>>>, Tuple)
0.82% 722.97us 10 72.296us 71.968us 72.703us _Z30julia_gpu_copy_real_component_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE25_gpu_copy_real_component_16CompilerMetadataI10StaticSizeI15_128__128__128_E12DynamicCheckvv7NDRangeILi3ES5_I11_8__8__128_ES5_I11_16__16__1_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEES10_I7ComplexIS9_ELi3ELi1EE
0.70% 614.46us 10 61.446us 60.800us 62.463us _Z33julia_partial_mapreduce_grid_71539_identity2__4Bool16CartesianIndicesILi3E5TupleI5OneToI5Int64ES4_IS5_ES4_IS5_EEES2_ILi3ES3_IS4_IS5_ES4_IS5_ES4_IS5_EEE3ValILitrueEE13CuDeviceArrayIS1_Li4ELi1EE11BroadcastedI12CuArrayStyleILi3EES3_IS4_IS5_ES4_IS5_ES4_IS5_EE6_isnanS3_IS7_I7Float64Li3ELi1EEEE
0.62% 545.31us 74 7.3690us 4.5440us 15.904us _Z28julia_gpu__fill_bottom_halo_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE23_gpu__fill_bottom_halo_16CompilerMetadataI10StaticSizeI10_128__128_E12DynamicCheckvv7NDRangeILi2ES5_I6_8__8_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE17BoundaryConditionI4FluxvE5Int64S13_
0.61% 535.07us 74 7.2300us 3.9040us 15.104us _Z25julia_gpu__fill_top_halo_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE20_gpu__fill_top_halo_16CompilerMetadataI10StaticSizeI10_128__128_E12DynamicCheckvv7NDRangeILi2ES5_I6_8__8_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE17BoundaryConditionI4FluxvE5Int64S13_
0.17% 151.97us 42 3.6180us 2.4960us 7.8720us _Z36julia_gpu_set_top_bottom_w_velocity_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE31_gpu_set_top_bottom_w_velocity_16CompilerMetadataI10StaticSizeI10_128__128_E12DynamicCheckvv7NDRangeILi2ES5_I6_8__8_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE5Int6417BoundaryConditionI4OpenvE22RegularRectilinearGridIS9_8PeriodicS15_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES18_IS9_EEEE10NamedTupleI27__time___iteration___stage_5TupleIS9_S11_S11_EES19_I16__u___v___w___b_S20_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEE
0.12% 101.34us 10 10.134us 9.9520us 10.400us _Z33julia_partial_mapreduce_grid_73419_identity2__4Bool16CartesianIndicesILi4E5TupleI5OneToI5Int64ES4_IS5_ES4_IS5_ES4_IS5_EEES2_ILi4ES3_IS4_IS5_ES4_IS5_ES4_IS5_ES4_IS5_EEE3ValILitrueEE13CuDeviceArrayIS1_Li5ELi1EES7_IS1_Li4ELi1EE
0.07% 63.776us 10 6.3770us 4.8320us 7.9040us _Z23julia_gpu__apply_y_bcs_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu__apply_y_bcs_16CompilerMetadataI10StaticSizeI10_128__128_E12DynamicCheckvv7NDRangeILi2ES5_I6_8__8_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE5TupleI6CenterS12_S12_E22RegularRectilinearGridIS9_8PeriodicS14_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES17_IS9_EEEE17BoundaryConditionIS14_vES18_IS14_vE10NamedTupleI27__time___iteration___stage_S11_IS9_5Int64S20_EES19_I16__u___v___w___b_S11_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEE
0.07% 60.160us 10 6.0160us 4.0320us 7.2960us _Z23julia_gpu__apply_z_bcs_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu__apply_z_bcs_16CompilerMetadataI10StaticSizeI10_128__128_E12DynamicCheckvv7NDRangeILi2ES5_I6_8__8_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE5TupleI4Face6CenterS13_E22RegularRectilinearGridIS9_8PeriodicS15_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES18_IS9_EEEE17BoundaryConditionI4FluxvES19_IS20_vE10NamedTupleI27__time___iteration___stage_S11_IS9_5Int64S22_EES21_I16__u___v___w___b_S11_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEE
0.07% 60.096us 10 6.0090us 5.1200us 7.3600us _Z23julia_gpu__apply_x_bcs_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu__apply_x_bcs_16CompilerMetadataI10StaticSizeI10_128__128_E12DynamicCheckvv7NDRangeILi2ES5_I6_8__8_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE5TupleI6CenterS12_S12_E22RegularRectilinearGridIS9_8PeriodicS14_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES17_IS9_EEEE17BoundaryConditionIS14_vES18_IS14_vE10NamedTupleI27__time___iteration___stage_S11_IS9_5Int64S20_EES19_I16__u___v___w___b_S11_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEE
0.07% 60.096us 10 6.0090us 3.8080us 8.4160us _Z23julia_gpu__apply_z_bcs_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu__apply_z_bcs_16CompilerMetadataI10StaticSizeI10_128__128_E12DynamicCheckvv7NDRangeILi2ES5_I6_8__8_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE5TupleI6CenterS12_S12_E22RegularRectilinearGridIS9_8PeriodicS14_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES17_IS9_EEEE17BoundaryConditionI4FluxvES18_IS19_vE10NamedTupleI27__time___iteration___stage_S11_IS9_5Int64S21_EES20_I16__u___v___w___b_S11_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEE
0.07% 57.952us 10 5.7950us 3.1040us 7.6480us _Z23julia_gpu__apply_y_bcs_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu__apply_y_bcs_16CompilerMetadataI10StaticSizeI10_128__128_E12DynamicCheckvv7NDRangeILi2ES5_I6_8__8_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE5TupleI6Center4FaceS12_E22RegularRectilinearGridIS9_8PeriodicS15_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES18_IS9_EEEE17BoundaryConditionIS15_vES19_IS15_vE10NamedTupleI27__time___iteration___stage_S11_IS9_5Int64S21_EES20_I16__u___v___w___b_S11_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEE
0.06% 54.304us 10 5.4300us 3.2640us 7.6800us _Z23julia_gpu__apply_y_bcs_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu__apply_y_bcs_16CompilerMetadataI10StaticSizeI10_128__128_E12DynamicCheckvv7NDRangeILi2ES5_I6_8__8_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE5TupleI6CenterS12_4FaceE22RegularRectilinearGridIS9_8PeriodicS15_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES18_IS9_EEEE17BoundaryConditionIS15_vES19_IS15_vE10NamedTupleI27__time___iteration___stage_S11_IS9_5Int64S21_EES20_I16__u___v___w___b_S11_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEE
0.06% 54.208us 10 5.4200us 2.6880us 7.5520us _Z23julia_gpu__apply_z_bcs_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu__apply_z_bcs_16CompilerMetadataI10StaticSizeI10_128__128_E12DynamicCheckvv7NDRangeILi2ES5_I6_8__8_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE5TupleI6CenterS12_4FaceE22RegularRectilinearGridIS9_8PeriodicS15_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES18_IS9_EEEE17BoundaryConditionI4OpenvES19_IS20_vE10NamedTupleI27__time___iteration___stage_S11_IS9_5Int64S22_EES21_I16__u___v___w___b_S11_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEE
0.06% 53.024us 10 5.3020us 4.0960us 7.1680us _Z23julia_gpu__apply_y_bcs_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu__apply_y_bcs_16CompilerMetadataI10StaticSizeI10_128__128_E12DynamicCheckvv7NDRangeILi2ES5_I6_8__8_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE5TupleI4Face6CenterS13_E22RegularRectilinearGridIS9_8PeriodicS15_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES18_IS9_EEEE17BoundaryConditionIS15_vES19_IS15_vE10NamedTupleI27__time___iteration___stage_S11_IS9_5Int64S21_EES20_I16__u___v___w___b_S11_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEE
0.06% 49.152us 10 4.9150us 2.4640us 7.1040us _Z23julia_gpu__apply_x_bcs_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu__apply_x_bcs_16CompilerMetadataI10StaticSizeI10_128__128_E12DynamicCheckvv7NDRangeILi2ES5_I6_8__8_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE5TupleI6Center4FaceS12_E22RegularRectilinearGridIS9_8PeriodicS15_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES18_IS9_EEEE17BoundaryConditionIS15_vES19_IS15_vE10NamedTupleI27__time___iteration___stage_S11_IS9_5Int64S21_EES20_I16__u___v___w___b_S11_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEE
0.06% 48.640us 10 4.8640us 2.4640us 6.7520us _Z23julia_gpu__apply_x_bcs_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu__apply_x_bcs_16CompilerMetadataI10StaticSizeI10_128__128_E12DynamicCheckvv7NDRangeILi2ES5_I6_8__8_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE5TupleI4Face6CenterS13_E22RegularRectilinearGridIS9_8PeriodicS15_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES18_IS9_EEEE17BoundaryConditionIS15_vES19_IS15_vE10NamedTupleI27__time___iteration___stage_S11_IS9_5Int64S21_EES20_I16__u___v___w___b_S11_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEE
0.06% 48.448us 10 4.8440us 3.1680us 7.2960us _Z23julia_gpu__apply_x_bcs_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu__apply_x_bcs_16CompilerMetadataI10StaticSizeI10_128__128_E12DynamicCheckvv7NDRangeILi2ES5_I6_8__8_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE5TupleI6CenterS12_4FaceE22RegularRectilinearGridIS9_8PeriodicS15_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES18_IS9_EEEE17BoundaryConditionIS15_vES19_IS15_vE10NamedTupleI27__time___iteration___stage_S11_IS9_5Int64S21_EES20_I16__u___v___w___b_S11_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEE
0.05% 46.080us 10 4.6080us 3.2000us 7.7120us _Z23julia_gpu__apply_z_bcs_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu__apply_z_bcs_16CompilerMetadataI10StaticSizeI10_128__128_E12DynamicCheckvv7NDRangeILi2ES5_I6_8__8_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE5TupleI6Center4FaceS12_E22RegularRectilinearGridIS9_8PeriodicS15_7BoundedS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES18_IS9_EEEE17BoundaryConditionI4FluxvES19_IS20_vE10NamedTupleI27__time___iteration___stage_S11_IS9_5Int64S22_EES21_I16__u___v___w___b_S11_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEE
0.01% 12.031us 10 1.2030us 1.1520us 1.5360us [CUDA memcpy DtoH]
0.01% 8.5120us 10 851ns 736ns 1.6960us [CUDA memcpy HtoD]
API calls: 94.28% 480.79ms 931 516.43us 11.854us 466.22ms cuLaunchKernel
2.89% 14.713ms 8320 1.7680us 1.3220us 11.862us cuStreamQuery
1.44% 7.3189ms 13013 562ns 426ns 10.621us cuCtxGetCurrent
0.32% 1.6540ms 982 1.6840us 1.1600us 10.417us cuStreamWaitEvent
0.23% 1.1818ms 80 14.772us 12.505us 19.825us cudaLaunchKernel
0.20% 1.0343ms 727 1.4220us 1.1430us 6.1680us cuEventRecord
0.17% 884.02us 727 1.2150us 871ns 8.6740us cuEventCreate
0.10% 504.05us 10 50.404us 5.7870us 429.47us cuStreamCreate
0.08% 433.01us 440 984ns 810ns 3.1500us cuOccupancyMaxPotentialBlockSize
0.07% 372.16us 18 20.675us 17.173us 28.561us cuMemAlloc
0.07% 364.98us 20 18.248us 16.190us 33.333us cuMemcpyDtoDAsync
0.05% 236.55us 10 23.655us 21.724us 26.417us cuMemcpyDtoHAsync
0.04% 207.19us 370 559ns 478ns 1.6810us cuDeviceGetAttribute
0.02% 114.83us 10 11.483us 10.220us 15.535us cuMemcpyHtoDAsync
0.01% 50.198us 20 2.5090us 2.1560us 5.2240us cuPointerGetAttribute
0.01% 29.353us 60 489ns 369ns 867ns cudaGetErrorString
0.00% 22.948us 40 573ns 420ns 1.1350us cudaGetLastError
0.00% 14.393us 20 719ns 588ns 862ns cuCtxSetCurrent
0.00% 11.328us 20 566ns 531ns 593ns cuCtxGetDevice
0.00% 4.5970us 1 4.5970us 4.5970us 4.5970us cuDeviceGetCount
@glwagner I also ran into some problems using StatProfilerHTML.jl
to make flame graphs for CPU profiles. This is from the same script used to obtain the results above and shown in #1914 and it's a 128^3 nonhydrostatic model. The flame graphs don't display the function names, and all I can see is "overdub". By hovering my mouse over the slabs and going up each flame stack I can usually find a function name that makes sense somewhere but that prevents us from making at-a-glance analysis of the profile flame graph.
I thought that this might have something to do with profiling run(simulation, 10)
instead of a for loop of time_step!(model,1)
but apparently the result is the same for both cases.
Thanks @hennyg888 for sharing these results.
On the GPU I think it's great to see that the tendencies are the top 4 items on the list and the next is the time stepping.
I would have thought that pressure might be more expensive than any of these but apparently not.
@glwagner I also ran into some problems using
StatProfilerHTML.jl
to make flame graphs for CPU profiles. This is from the same script used to obtain the results above and shown in #1914 and it's a 128^3 nonhydrostatic model. The flame graphs don't display the function names, and all I can see is "overdub". By hovering my mouse over the slabs and going up each flame stack I can usually find a function name that makes sense somewhere but that prevents us from making at-a-glance analysis of the profile flame graph. I thought that this might have something to do with profilingrun(simulation, 10)
instead of a for loop oftime_step!(model,1)
but apparently the result is the same for both cases.
I believe this is inevitable, because all our kernels are compiled through Cassette.jl
, which "overdubs" the julia compiler when compiling functions tagged with @kernel
(the majority of our expensive kernels). This is part of the design of KernelAbstractions.jl
...
Really great work @hennyg888. Perhaps the complexity of our function calls via KernelAbstractions.jl
argues for a better profiling approach? Is there a way to "filter" the profiled output to remove data?
I think the next step towards improving performance is to figure out how to optimize the tendency calculations for CPU or GPU.
@christophernhill do you think you could produce a script with non-trivial dynamics involving the HydrostaticFreeSurfaceModel
and the implicit solver?
We should also come up with something that exercises the tridiagonal solver on a vertically-stretched grid.
@glwagner : but I remember you had a flame graph that actually had names of functions in #1919. What did you do differently there?
@christophernhill do you think you could produce a script with non-trivial dynamics involving the
HydrostaticFreeSurfaceModel
and the implicit solver?We should also come up with something that exercises the tridiagonal solver on a vertically-stretched grid.
@glwagner @francispoulin and @hennyg888, we could start from https://github.com/CliMA/Oceananigans.jl/blob/master/validation/barotropic_gyre/barotropic_gyre.jl ? I'll check that it is still healthy. We can make the number of points bigger or smaller to look at problem size. Do we want to also try RegularLatitudeLongitudeGrid
or should we do a box first . This also has an ImmersedBoundaryGrid
bump in the domain - we can get rid of that for now, but could include that too down the road.
We should be able to add some vertical levels to this and turn on some implicit vertical diffusion - which is another tridiagonal solve?
@glwagner : but I remember you had a flame graph that actually had names of functions in #1919. What did you do differently there?
I didn't do anything differently --- I think perhaps because it was a different problem, the flame graph results were different?
@glwagner @francispoulin and @hennyg888 I added #1928 toward being able to do a meaningful HydrostaticFreeSurface
. When #1928 is fixed we should be good to add a setup for benchmarking. 🤞
@christophernhill is it possible to come up with a benchmark that does not use ContinuousBoundaryFunction
, thereby avoiding the bug in #1928 ?
@christophernhill : I see that #1928 has now been merged. Do you have an example that you would like us to try benchmarking?
@christophernhill @glwagner @ali-ramadhan
I obtained some interesting results from profiling the shallow water model running on GPU. This was done on Satori's login-002.
The gist of it is that varying gird sizes does not change GPU activities except when the grid size gets very small e.g. 128 x 128. All other grid resolutions profiled had about the same GPU activities result as shown below and so only one set is shown. As far as @francispoulin and I know, the GPU activities seem to be correct, with what should be taking up the most time doing so.
However, for API calls, results differ a lot based on grid resolution. As the grid increases in size, cuStreamQuery
and eventually cuCtxGetCurrent
becomes the dominant API call. See below the API call profile result tables for the different grid sizes. It seems that cuStreamQuery
is what is checking on the status of the cuda streams so larger grids taking more time to run the kernels than launching the kernels may have something to do with it.
Oceananigans v0.61.0
Julia Version 1.6.2
Commit 1b93d53fc4 (2021-07-14 15:36 UTC)
Platform Info:
OS: Linux (powerpc64le-unknown-linux-gnu)
CPU: unknown
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, pwr9)
GPU: Tesla V100-SXM2-32GB
CUDA toolkit 10.2.89, local installation
CUDA driver 10.2.0
NVIDIA driver 440.64.0
Libraries:
- CUBLAS: 10.2.2
- CURAND: 10.1.2
- CUFFT: 10.1.2
- CUSOLVER: 10.3.0
- CUSPARSE: 10.3.1
- CUPTI: 12.0.0
- NVML: 10.0.0+440.64.0
- CUDNN: missing
- CUTENSOR: missing
Toolchain:
- Julia: 1.6.2
- LLVM: 11.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5
- Device capability support: sm_30, sm_32, sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75
2 devices:
0: Tesla V100-SXM2-32GB (sm_70, 31.432 GiB / 31.749 GiB available)
1: Tesla V100-SXM2-32GB (sm_70, 31.738 GiB / 31.749 GiB available)
nothing
[2021/08/11 22:39:51.084] INFO Setting up benchmark: (GPU, Float64, 2048)...
[2021/08/11 22:40:32.330] INFO warming up
[2021/08/11 22:41:32.311] INFO Simulation is stopping. Model iteration 1 has hit or exceeded simulation stop iteration 1.
[2021/08/11 22:41:32.729] WARN Calling CUDA.@profile only informs an external profiler to start.
The user is responsible for launching Julia under a CUDA profiler.
It is recommended to use Nsight Systems, which supports interactive profiling:
$ nsys launch julia -@-> /home/henryguo/.julia/packages/CUDA/CtvPY/lib/cudadrv/profile.jl:71
[2021/08/11 22:41:32.777] INFO Simulation is stopping. Model iteration 11 has hit or exceeded simulation stop iteration 11.
[2021/08/11 22:41:34.842] INFO done profiling (GPU, Float64, 2048)
==41185== Profiling application: /nobackup/users/henryguo/projects/henry-test/julia-1.6.2/bin/julia --project shallow_water_profiler.jl
==41185== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 36.32% 15.483ms 10 1.5483ms 1.5398ms 1.5571ms _Z24julia_gpu_calculate_Gvh_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE19_gpu_calculate_Gvh_16CompilerMetadataI10StaticSizeI15_2048__2048__1_E12DynamicCheckvv7NDRangeILi3ES5_I13_128__128__1_ES5_I11_16__16__1_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE22RegularRectilinearGridIS9_8PeriodicS12_4FlatS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES15_IS9_EEEES9_5WENO5vvv10NamedTupleI14__uh___vh___h_5TupleIS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEES17_I2__S18_EvS17_I14__uh___vh___h_S18_I12_zeroforcingS19_S19_EES17_I27__time___iteration___stage_S18_IS9_5Int64S20_EE
35.40% 15.088ms 10 1.5088ms 1.5042ms 1.5122ms _Z24julia_gpu_calculate_Guh_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE19_gpu_calculate_Guh_16CompilerMetadataI10StaticSizeI15_2048__2048__1_E12DynamicCheckvv7NDRangeILi3ES5_I13_128__128__1_ES5_I11_16__16__1_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE22RegularRectilinearGridIS9_8PeriodicS12_4FlatS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES15_IS9_EEEES9_5WENO5vvv10NamedTupleI14__uh___vh___h_5TupleIS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEES17_I2__S18_EvS17_I14__uh___vh___h_S18_I12_zeroforcingS19_S19_EES17_I27__time___iteration___stage_S18_IS9_5Int64S20_EE
13.03% 5.5520ms 30 185.07us 178.24us 192.03us _Z25julia_gpu_ab2_step_field_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE20_gpu_ab2_step_field_16CompilerMetadataI10StaticSizeI15_2048__2048__1_E12DynamicCheckvv7NDRangeILi3ES5_I13_128__128__1_ES5_I11_16__16__1_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE5Int64S9_S8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEE
7.44% 3.1730ms 30 105.77us 103.10us 110.40us _Z33julia_gpu_store_field_tendencies_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE28_gpu_store_field_tendencies_16CompilerMetadataI10StaticSizeI15_2048__2048__1_E12DynamicCheckvv7NDRangeILi3ES5_I13_128__128__1_ES5_I11_16__16__1_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE22RegularRectilinearGridIS9_8PeriodicS12_4FlatS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES15_IS9_EEEES8_IS9_Li3ES10_IS9_Li3ELi1EEE
3.32% 1.4150ms 10 141.50us 140.86us 142.21us _Z23julia_gpu_calculate_Gh_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu_calculate_Gh_16CompilerMetadataI10StaticSizeI15_2048__2048__1_E12DynamicCheckvv7NDRangeILi3ES5_I13_128__128__1_ES5_I11_16__16__1_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE22RegularRectilinearGridIS9_8PeriodicS12_4FlatS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES15_IS9_EEEES9_vvv10NamedTupleI14__uh___vh___h_5TupleIS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEES16_I2__S17_EvS16_I14__uh___vh___h_S17_I12_zeroforcingS18_S18_EES16_I27__time___iteration___stage_S17_IS9_5Int64S19_EE
2.27% 966.33us 10 96.633us 95.647us 99.072us _Z33julia_partial_mapreduce_grid_60479_identity2__4Bool16CartesianIndicesILi3E5TupleI5OneToI5Int64ES4_IS5_ES4_IS5_EEES2_ILi3ES3_IS4_IS5_ES4_IS5_ES4_IS5_EEE3ValILitrueEE13CuDeviceArrayIS1_Li4ELi1EE11BroadcastedI12CuArrayStyleILi3EES3_IS4_IS5_ES4_IS5_ES4_IS5_EE6_isnanS3_IS7_I7Float64Li3ELi1EEEE
0.79% 337.76us 66 5.1170us 4.8000us 5.6960us _Z27julia_broadcast_kernel_514115CuKernelContext8SubArrayI7Float64Li3E13CuDeviceArrayIS1_Li3ELi1EE5TupleI9UnitRangeI5Int64E5SliceI5OneToIS5_EES6_IS7_IS5_EEELifalseEE11BroadcastedIvS3_IS7_IS5_ES7_IS5_ES7_IS5_EE9_identityS3_I8ExtrudedIS0_IS1_Li3ES2_IS1_Li3ELi1EES3_IS4_IS5_ES6_IS7_IS5_EES6_IS7_IS5_EEELifalseEES3_I4BoolS11_S11_ES3_IS5_S5_S5_EEEES5_
0.68% 289.05us 66 4.3790us 3.9360us 5.0240us _Z27julia_broadcast_kernel_530115CuKernelContext8SubArrayI7Float64Li3E13CuDeviceArrayIS1_Li3ELi1EE5TupleI5SliceI5OneToI5Int64EE9UnitRangeIS6_ES4_IS5_IS6_EEELifalseEE11BroadcastedIvS3_IS5_IS6_ES5_IS6_ES5_IS6_EE9_identityS3_I8ExtrudedIS0_IS1_Li3ES2_IS1_Li3ELi1EES3_IS4_IS5_IS6_EES7_IS6_ES4_IS5_IS6_EEELifalseEES3_I4BoolS11_S11_ES3_IS6_S6_S6_EEEES6_
0.20% 83.359us 10 8.3350us 7.1680us 11.008us _Z33julia_partial_mapreduce_grid_62649_identity2__4Bool16CartesianIndicesILi4E5TupleI5OneToI5Int64ES4_IS5_ES4_IS5_ES4_IS5_EEES2_ILi4ES3_IS4_IS5_ES4_IS5_ES4_IS5_ES4_IS5_EEE3ValILitrueEE13CuDeviceArrayIS1_Li5ELi1EES7_IS1_Li4ELi1EE
0.10% 42.590us 10 4.2590us 3.2320us 5.2480us _Z23julia_gpu__apply_x_bcs_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu__apply_x_bcs_16CompilerMetadataI10StaticSizeI9_2048__1_E12DynamicCheckvv7NDRangeILi2ES5_I8_128__1_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE5TupleI6CenterS12_S12_E22RegularRectilinearGridIS9_8PeriodicS14_4FlatS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES17_IS9_EEEE17BoundaryConditionIS14_vES18_IS14_vE10NamedTupleI27__time___iteration___stage_S11_IS9_5Int64S20_EES19_I14__uh___vh___h_S11_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEE
0.09% 39.486us 10 3.9480us 2.7840us 4.6720us _Z23julia_gpu__apply_y_bcs_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu__apply_y_bcs_16CompilerMetadataI10StaticSizeI9_2048__1_E12DynamicCheckvv7NDRangeILi2ES5_I8_128__1_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE5TupleI6Center4FaceS12_E22RegularRectilinearGridIS9_8PeriodicS15_4FlatS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES18_IS9_EEEE17BoundaryConditionIS15_vES19_IS15_vE10NamedTupleI27__time___iteration___stage_S11_IS9_5Int64S21_EES20_I14__uh___vh___h_S11_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEE
0.09% 38.720us 10 3.8720us 2.5920us 5.0240us _Z23julia_gpu__apply_y_bcs_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu__apply_y_bcs_16CompilerMetadataI10StaticSizeI9_2048__1_E12DynamicCheckvv7NDRangeILi2ES5_I8_128__1_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE5TupleI4Face6CenterS13_E22RegularRectilinearGridIS9_8PeriodicS15_4FlatS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES18_IS9_EEEE17BoundaryConditionIS15_vES19_IS15_vE10NamedTupleI27__time___iteration___stage_S11_IS9_5Int64S21_EES20_I14__uh___vh___h_S11_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEE
0.08% 34.656us 10 3.4650us 3.1680us 4.1920us _Z23julia_gpu__apply_x_bcs_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu__apply_x_bcs_16CompilerMetadataI10StaticSizeI9_2048__1_E12DynamicCheckvv7NDRangeILi2ES5_I8_128__1_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE5TupleI6Center4FaceS12_E22RegularRectilinearGridIS9_8PeriodicS15_4FlatS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES18_IS9_EEEE17BoundaryConditionIS15_vES19_IS15_vE10NamedTupleI27__time___iteration___stage_S11_IS9_5Int64S21_EES20_I14__uh___vh___h_S11_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEE
0.08% 33.886us 10 3.3880us 2.5280us 4.9270us _Z23julia_gpu__apply_x_bcs_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu__apply_x_bcs_16CompilerMetadataI10StaticSizeI9_2048__1_E12DynamicCheckvv7NDRangeILi2ES5_I8_128__1_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE5TupleI4Face6CenterS13_E22RegularRectilinearGridIS9_8PeriodicS15_4FlatS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES18_IS9_EEEE17BoundaryConditionIS15_vES19_IS15_vE10NamedTupleI27__time___iteration___stage_S11_IS9_5Int64S21_EES20_I14__uh___vh___h_S11_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEE
0.08% 33.696us 10 3.3690us 2.5920us 4.0320us _Z23julia_gpu__apply_y_bcs_7ContextI14__CUDACtx_Namevv14__PassType_257v12DisableHooksE18_gpu__apply_y_bcs_16CompilerMetadataI10StaticSizeI9_2048__1_E12DynamicCheckvv7NDRangeILi2ES5_I8_128__1_ES5_I8_16__16_EvvEE11OffsetArrayI7Float64Li3E13CuDeviceArrayIS9_Li3ELi1EEE5TupleI6CenterS12_S12_E22RegularRectilinearGridIS9_8PeriodicS14_4FlatS8_IS9_Li1E12StepRangeLenIS9_14TwicePrecisionIS9_ES17_IS9_EEEE17BoundaryConditionIS14_vES18_IS14_vE10NamedTupleI27__time___iteration___stage_S11_IS9_5Int64S20_EES19_I14__uh___vh___h_S11_IS8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEES8_IS9_Li3ES10_IS9_Li3ELi1EEEEE
0.03% 12.800us 10 1.2800us 1.2160us 1.6320us [CUDA memcpy DtoH]
grid = 16384 x 16384
API calls: 70.92% 702.12ms 468805 1.4970us 1.2730us 101.02us cuStreamQuery
28.00% 277.25ms 470363 589ns 433ns 15.851us cuCtxGetCurrent
0.85% 8.3729ms 302 27.724us 11.380us 3.7689ms cuLaunchKernel
0.05% 493.73us 300 1.6450us 1.1820us 4.9350us cuStreamWaitEvent
0.04% 369.46us 253 1.4600us 1.2090us 3.5480us cuEventRecord
0.03% 344.38us 20 17.218us 12.297us 22.727us cuMemAlloc
0.03% 326.83us 253 1.2910us 939ns 2.5510us cuEventCreate
0.03% 283.23us 10 28.323us 26.575us 32.548us cuMemcpyDtoHAsync
0.02% 218.41us 152 1.4360us 1.2710us 6.3380us cuOccupancyMaxPotentialBlockSize
0.02% 208.20us 370 562ns 480ns 851ns cuDeviceGetAttribute
0.00% 24.869us 10 2.4860us 2.3320us 2.8590us cuPointerGetAttribute
0.00% 16.819us 2 8.4090us 6.1830us 10.636us cuStreamCreate
0.00% 14.325us 20 716ns 567ns 920ns cuCtxSetCurrent
0.00% 10.905us 20 545ns 502ns 576ns cuCtxGetDevice
0.00% 2.3610us 1 2.3610us 2.3610us 2.3610us cuDeviceGetCount
grid = 4096 x 4096
API calls: 60.78% 39.901ms 26114 1.5270us 1.2380us 125.51us cuStreamQuery
22.99% 15.091ms 27670 545ns 432ns 5.9410us cuCtxGetCurrent
12.95% 8.5006ms 302 28.147us 11.910us 3.9653ms cuLaunchKernel
0.74% 483.32us 300 1.6110us 1.2110us 3.1970us cuStreamWaitEvent
0.56% 369.64us 253 1.4610us 1.2300us 4.6640us cuEventRecord
0.49% 319.93us 253 1.2640us 951ns 3.4240us cuEventCreate
0.40% 261.89us 18 14.549us 11.922us 23.596us cuMemAlloc
0.37% 241.30us 10 24.129us 20.979us 34.250us cuMemcpyDtoHAsync
0.33% 214.83us 152 1.4130us 1.2690us 2.7320us cuOccupancyMaxPotentialBlockSize
0.31% 201.30us 370 544ns 471ns 996ns cuDeviceGetAttribute
0.04% 23.055us 10 2.3050us 1.7710us 4.1930us cuPointerGetAttribute
0.03% 17.034us 2 8.5170us 6.2180us 10.816us cuStreamCreate
0.02% 13.902us 20 695ns 574ns 1.0230us cuCtxSetCurrent
0.02% 10.967us 20 548ns 477ns 719ns cuCtxGetDevice
0.00% 3.0570us 1 3.0570us 3.0570us 3.0570us cuDeviceGetCount
grid = 2048 x 2048
API calls: 37.92% 8.8570ms 302 29.327us 11.432us 4.4105ms cuLaunchKernel
36.94% 8.6294ms 5393 1.6000us 1.2680us 8.0800us cuStreamQuery
15.99% 3.7341ms 6949 537ns 432ns 5.0180us cuCtxGetCurrent
2.13% 496.43us 300 1.6540us 1.2310us 3.9350us cuStreamWaitEvent
1.56% 364.41us 253 1.4400us 1.2460us 3.5350us cuEventRecord
1.34% 313.77us 253 1.2400us 912ns 3.3890us cuEventCreate
1.08% 251.42us 18 13.967us 11.806us 23.128us cuMemAlloc
0.99% 230.45us 10 23.045us 20.917us 32.999us cuMemcpyDtoHAsync
0.91% 212.61us 152 1.3980us 1.2300us 2.4020us cuOccupancyMaxPotentialBlockSize
0.87% 203.87us 370 551ns 484ns 924ns cuDeviceGetAttribute
0.08% 19.701us 10 1.9700us 1.7380us 3.3080us cuPointerGetAttribute
0.07% 17.108us 2 8.5540us 6.2570us 10.851us cuStreamCreate
0.06% 14.465us 20 723ns 560ns 1.2330us cuCtxSetCurrent
0.05% 11.167us 20 558ns 459ns 785ns cuCtxGetDevice
0.01% 2.2130us 1 2.2130us 2.2130us 2.2130us cuDeviceGetCount
gird = 512 x 512
API calls: 67.86% 8.3255ms 302 27.567us 11.810us 3.8990ms cuLaunchKernel
7.98% 979.53us 1731 565ns 443ns 2.9160us cuCtxGetCurrent
6.89% 845.51us 173 4.8870us 1.4420us 7.5840us cuStreamQuery
3.82% 468.14us 300 1.5600us 1.1470us 2.6330us cuStreamWaitEvent
2.94% 360.57us 253 1.4250us 1.2050us 9.9840us cuEventRecord
2.59% 317.60us 253 1.2550us 932ns 3.1190us cuEventCreate
2.19% 268.74us 20 13.436us 11.420us 23.667us cuMemAlloc
1.87% 229.49us 10 22.948us 21.019us 31.754us cuMemcpyDtoHAsync
1.72% 211.30us 152 1.3900us 1.2580us 2.3280us cuOccupancyMaxPotentialBlockSize
1.63% 199.48us 370 539ns 469ns 756ns cuDeviceGetAttribute
0.16% 19.342us 10 1.9340us 1.7360us 2.9230us cuPointerGetAttribute
0.14% 17.131us 2 8.5650us 6.6240us 10.507us cuStreamCreate
0.11% 13.659us 20 682ns 613ns 853ns cuCtxSetCurrent
0.09% 11.188us 20 559ns 516ns 846ns cuCtxGetDevice
0.02% 2.3790us 1 2.3790us 2.3790us 2.3790us cuDeviceGetCount
grid = 128 x 128
API calls: 66.93% 8.2732ms 302 27.394us 11.588us 3.8998ms cuLaunchKernel
7.77% 959.95us 1731 554ns 433ns 2.5960us cuCtxGetCurrent
6.96% 860.47us 173 4.9730us 4.4450us 7.9010us cuStreamQuery
3.79% 468.98us 300 1.5630us 1.1700us 3.6250us cuStreamWaitEvent
2.96% 365.37us 253 1.4440us 1.2160us 3.8400us cuEventRecord
2.90% 358.58us 152 2.3590us 1.2750us 16.503us cuOccupancyMaxPotentialBlockSize
2.57% 317.68us 253 1.2550us 920ns 3.3410us cuEventCreate
2.21% 272.61us 20 13.630us 11.594us 23.538us cuMemAlloc
1.84% 227.46us 10 22.745us 20.907us 32.177us cuMemcpyDtoHAsync
1.55% 191.40us 350 546ns 485ns 1.0060us cuDeviceGetAttribute
0.17% 21.476us 10 2.1470us 1.9050us 3.5970us cuPointerGetAttribute
0.14% 17.065us 2 8.5320us 6.3880us 10.677us cuStreamCreate
0.11% 13.557us 20 677ns 590ns 802ns cuCtxSetCurrent
0.09% 10.935us 20 546ns 494ns 590ns cuCtxGetDevice
0.02% 2.3300us 1 2.3300us 2.3300us 2.3300us cuDeviceGetCount
@christophernhill I also took a look at the GFlops.jl
package. As said on its homepage: "GFlops.jl does not see what happens outside the realm of Julia code. It especially does not see operations performed in external libraries such as BLAS calls."
It works similarly to the profile macro and it can count basic math operations performed by whatever follows the macro or benchmark it for its Flops metric. These doesn't seem to work with simulations but works fine for time_step!(model, 1)
due to the benchmarking process performing many evaluations of the code.
For the nonhydrostatic model running on CPU, @count_ops
did not produce any results for either the simulation run or the time_step!, and @gflops
produced the results below for time step!:
0.02 GFlops, 0.04% peak (1.89e+07 flop, 1.01e+00 s)
According to @maleadt on the Julia slack's GPU channel and in regards to the shallow water model profiles:
Don't focus on time spent in API calls to much. since GPU execution is asynchronous, you'll have to synchronize at some point, and that API call will then 'soak up' time until the stream has finished executing. and here that's literally the synchronize function, which is implemented using cuStreamQuery: https://github.com/JuliaGPU/CUDA.jl/blob/2b3ec03ff9774b65541fc88dd6b0f1f7aea5d9e0/lib/cudadrv/stream.jl#L115-L144
use a timeline profiler (i.e. NSNight Systems) to profile your app, or nvpp if you really want to use the old profiler toolchain. plain nvprof results are too simple once your application hits some level of complexity
now, it is possible that our CPU-side implementation of synchronize does too many API calls and could be optimized a little, but in the end the call serves to wait until the GPU has finished so it probably doesn't matter much. if it does, e.g. because you want to perform other useful work on another CPU task concurrently, you could try to profile that in isolation and file an issue.
Essentially, Tim explains that cuStreamQuery
takes up more time as the grid size increases because it's called in the synchronize function. The synchronize function as shown in the link above tends to be called more and soaks up more waiting time the bigger the problem hence why it scales positively to grid size.
Taking a closer look at the shallow water gpu profiling results above, it seems that cuStreamQuery
takes up a lot of time in the finer resolution runs because it is called many times and not because each call takes a lot of time. For example, in the 16k case, cuSteamQuery
is called three order of magnitudes more times than cuLaunchKernel
while both calls are measured in microseconds.
I'm not sure if cuStreamQuery
being called 400,000 times is an error with our code, an error with CUDA.jl, not an error at all, or an error with my profiling.
I'm not sure if
cuStreamQuery
being called 400,000 times is an error with our code, an error with CUDA.jl, not an error at all, or an error with my profiling.
I didn't know this was a KA.jl-based GPU workload when commenting on Slack. The dependency/event model of KernelAbstractions.jl also uses stream queries (i.e. cuStreamQuery
) when selecting a new stream. Maybe that's the source of these calls. It'd be good to figure out where they come from: if it's from CUDA.jl, and thus presumably because of calling the synchronize
function, (1) why are you synchronizing that much [1], and if it's for good reasons (2) does it hurt performance and should we tweak our synchronize
implementation to perform fewer stream queries?
[1]: some synchronization happens implicitly, e.g. when copying memory to or from the CPU (https://github.com/JuliaGPU/CUDA.jl/blob/6758fcab7ae0d72659a1ca0d56ad2c86d3b451f1/src/array.jl#L385-L399). One way to avoid some of those synchronizations, is by using pinned memory, but that's up to the application.
@maleadt I used Nsight System's nsys to profile the exact same shallow water model setup shown above with grid size being 16384 x 16384 and got the following results:
From what I can see, the CUDA API row only starts getting filled with activities towards the end of run and most of it is cuStreamWaitEvent
and some memcpy's. Another thing to note is that while viewing the CUDA API row's info in events view as shown in the table below, I could not find one, much less 400,000, calls to cuStreamQuery
. As seen in the table, I sorted the events by name and cuStreamQuery
is nowhere to be found between cuStreamDestroy
and cuStreamWaitEvent
.
Here are some profiling results that were done on Satori with nvprof. This is a GPU profile of the nonhydrostatic model.