Closed maximilian-gelbrecht closed 5 months ago
With this PR it seems that the roadmap is as follows
LowerTriangularArray
so that the current LowerTriangularMatrix
is the 2D special case of it (this PR)OctahedralGaussianArray
so that the current OctahedralGaussianGrid
is the 2D special case of it (another PR)LowerTriangularArray
and 2D/3D OctahedralGaussianArray
(or other ring grids) and adapt the kernels accordingly (another PR)I don't really see a way to split up the last point into several PRs that'll just be a monster PR then again I guess, but that's probably fine.
Yes this sounds reasonable.
For the third point, in theory we could split it into multiple parts, by defining temporary aliases for these structures that allow to access the n-dimensional arrays like the vectors of vectors.
It's also a question if the third point should already include proper GPU/Kernelabstraction kernels or not.
It's also a question if the third point should already include proper GPU/Kernelabstraction kernels or not.
I think it'd like to go step by step. Simply because changing the indexing structure may introduce errors that we could spot more easily if we don't also do GPU kernels with KernelAbstractions in one go. However, one could already come up with a set of allowed operations / recommended structure of how to write the kernels so that, while still everything is CPU-based and KernelAbstractions-independent we have a kernel that is super close to what it's supposed to be.
So for example if you have a look at the kernels and say what we shouldn't do for the GPU then I'd be happy to write the kernels already for that in mind.
The PR is coming along nicely, but the big bit that's missing is the broadcasting on GPU. I've had a quick look into it, and I am not quite certain to proceed there right now. To recap:
Our LowerTriangleArray
stores only the lower triangle of an array/matrix by flattening the first two dimensions. It wraps around an array data::AbstractArray{T,N-1}
, but is a subtype of AbstractArray{T,N}
. By defining custom getindex
we can either index it with N-1
indices by directly indexing the data
Array, or by converting N
indices appropriately, so that it can actually behave like a AbstractArray{T,N}
in most situations.
For the broadcast that's a problem, as it wants to access the upper triangle that doesn't exists, and therefore quickly either causes BoundsErrors
or incorrect results. On CPU I was able to fix this by defining my own broadcast style and extending the copyto!(dest::LowerTriangularArray, bc::Broadcasted)
function, so that it correctly copies over only the lower triangle. On GPU, I looked into what's implemented over at GPUArrays.jl
, but I am not quite certain if we have to implement a copyto!
/_copyto!
with a new kernel here, or if it's better to extent one of the other functions related to the broadcasting and the broadcast style that's defined by GPUArrays.jl
. Is it possible to change the axes of the object just for the broadcasting? There's the Broadcast.preprocess
function. Maybe that's something that could work?
What we want is that any broadcasting just is directly applied to all elements of the data::AbstractArray{T,N-1}
object LowerTriangleArray
wraps around, while ideallyLowerTriangleArray <: AbstractArray{T,N}
.
@vchuravy do you have some insights on how to define array broadcasting for the GPU? ☝🏼
I made an initial draft for GPU broadcasting that's working, but uses the k2ij
function, so it is quite costly. I more or less just adjusted the code from GPUArrays.jl for it.
We could either:
k2ij
and keep them as a field of all LowerTriangularArray
k2ij
matrix at once before the kernel is launched. I suspect there's a faster way to compute all indices at once than the formula involving the root to compute them individually. The only reason I didn't implement this yet is that I couldn't easily think of a way that's GPU compatible as well. On CPU it's just a nested loop counting up. That's not easily doable on GPU, but maybe it's also fine to always do it on CPU...I think the GPU broadcasting needs to be tested a little bit more though.
Otherwise I can slowly begin the clean up of the PR, and update the doc.
@milankl what's your opinion on the last open points from the code review, e.g. what eachharmonic
should do?
I updated the documentation of the LowerTriangularMatrices
submodule, but I didn't update all uses of LowerTriangularMatrix
throughout the documentation. As LowerTriangularMatrix
still works as before, as an alias of 2D LowerTraingualarArray
we could also leave it like that.
Awesome this is fantastic! Thank you so much 👏 I'll try to mirror your efforts here for the RingGrids then as the next step.
@milankl what's your opinion on the last open points from the code review, e.g. what
eachharmonic
should do?
I'm for deprecating both eachharmonic
and eachgridpoint
in favour of extending eachindex
. I introduced them only because I thought they'd add more clarity but I actually find them confusing because lower triangular matrices don't have to be harmonics and there's no win over using eachgridpoint(grid)
compares to eachindex(grid)
. Have we tested whether lower triangular matrices being an Abstractmatrix but eachindex only hitting the lower triangle is a problem?
I made an initial draft for GPU broadcasting that's working, but uses the
k2ij
function, so it is quite costly. I more or less just adjusted the code from GPUArrays.jl for it.We could either:
Keep it like that, and not use broadcasting for performance critical stuff
Pre-compute all
k2ij
and keep them as a field of allLowerTriangularArray
Intermediate way: at every broadcasting compute the full
k2ij
matrix at once before the kernel is launched. I suspect there's a faster way to compute all indices at once than the formula involving the root to compute them individually. The only reason I didn't implement this yet is that I couldn't easily think of a way that's GPU compatible as well. On CPU it's just a nested loop counting up. That's not easily doable on GPU, but maybe it's also fine to always do it on CPU...
I believe most of broadcasting is for user interface purposes but there are a few places in the dynamical core where we do @. c*A + B
or similar where c::Number
, A, B
both lower triangular matrices. Maybe we can use that as a performance benchmark that shouldn't be much slower as it is now.
I'm tempted to go for "keep it like that" but then manually define the broadcasting operations that are performance critical so that we're not dealing with a performance regression inside the dynamical core. I'll look through the code now and put a list of these in here
Where we use broadcasting in the dynamical core:
I think that's it.
Have we tested whether lower triangular matrices being an Abstractmatrix but eachindex only hitting the lower triangle is a problem?
eachindex
is defined as
eachindex(L::LowerTriangularArray) = eachindex(L.data)
so it just returns the valid indices of the underlying data. This overrides any defaults from the AbstractArray interface
Where we use broadcasting in the dynamical core:
....
I think that's it.
So, basically that's:
The first two we have already covered. The last one we can easily define as
elementwise_mul(L1::LowerTriangularArray, L2::LowerTriangularArray) = LowerTriangularArray(L1.data .* L2.data, L1.m, L1.n)
(with some boundscheck added)
* Elementwise multiplication with another LTA (not sure?)
Yes, we technically don't need this as this is mathematically not a well defined operation for two sets of spherical harmonic coefficients, but it might come in handy if we want to broadcast operations like $\nabla^2 A$, where the $\nabla^2$ can be thought of as a matrix. So I'd add this while we're at it and if it doesn't cause complications.
Where we use broadcasting in the dynamical core:
...
I think that's it.
Could quickly write down for all of them if it is a number, a LTA or something else? For most I know, but with some I am not sure. Then I'd add all of these kind of operations to the tests for the broadcast. Then, we have that covered as well.
@milankl There's something wrong with the documentation build, and I don't know what. It literally takes hours. Do you have any idea? This was already the case before I made any changes to the documentation, I am just noticing now. The report shows several errors from the analysis.md
but I don't know if that's really the culprit.
Huh, now the documentation build is just failing relatively quickly. The only important change that I did was to move JLArrays
from a direct dependency to [extras] and [test]. Strange...
This why
┌ Error: failed to run `@example` block in src/analysis.md:69-74
│ ```@example analysis
│ using SpeedyWeather
│ spectral_grid = SpectralGrid(trunc=31, nlev=1)
│ model = ShallowWaterModel(;spectral_grid)
│ simulation = initialize!(model)
│ ```
│ exception =
│ MethodError: no method matching (LowerTriangularMatrix{ComplexF64})(::UndefInitializer, ::Int64, ::Int64)
│ Stacktrace:
│ [1] spectral(map::FullGaussianGrid{Float64}, S::SpectralTransform{Float64})
│ @ SpeedyWeather.SpeedyTransforms ~/work/SpeedyWeather.jl/SpeedyWeather.jl/src/SpeedyTransforms/spectral_transform.jl:564
I don't think this is the reason why the CI jobs are stuck for 6 hours though. It's a bit strange.
Next week I'll adjust it, so that really all tests of SpeedyWeather pass.
There's one initial change I already had to make:
const LowerTriangularMatrix{T} = LowerTriangularArray{T,2,Vector{T}
so it's always a CPU array. Now it should be really backwards compatible.
Another thing I noticed the tests for spectral_gradient, want to index the LowerTriangularMatrix with :
Previously this worked just via the fallback/default routines from AbstractArray. As we have a bit more complex indexing now with the n-dimensional case this doesn't work anymore, but we could define I it.
My suggestion:
getindex
version with the :
, but I also don't think we really need itsetindex
versions, and instead change the tests. Doing a set index is dangerous because it really only works along the first column and last row. Otherwise some elements don't exist. So, we also shouldn't implement it.At the same time though, indexing trailing dimensions with :
works. And indexing the lower triangle with ranges works as well. So something like
vor[1:10,1,:,:]
is always possible.
Also AssiociatedLegendrePolynomials.jl indexes 2D arrays with: [CartesianIndex(), i, j]. So a leading empty index. Apparently AbstractArrays allow that, so I had to add a function that allows that as well. Just maybe something to watch out for when adjusting the transforms to work with the N-dimensional arrays. Maybe there will be another problem there.
Just dropping in here to say nice work :partying_face:
Just dropping in here to say nice work 🥳
You're always welcome to join the party!! It's fun, I promise! Have you seen the new cool Makie extension? And also I've documented convection and condensation, they're much more stable and mature now.
I don't think this is the reason why the CI jobs are stuck for 6 hours though. It's a bit strange.
Stuck sounds like #483 ?
const LowerTriangularMatrix{T} = LowerTriangularArray{T,2,Vector{T}
so it's always a CPU array. Now it should be really backwards compatible.
That's great yeah, I think a lot of the interactive stuff would happen on the CPU anyway, so it's fine to fix it to a vector.
Another thing I noticed the tests for spectral_gradient, want to index the LowerTriangularMatrix with
:
Previously this worked just via the fallback/default routines from AbstractArray. As we have a bit more complex indexing now with the n-dimensional case this doesn't work anymore, but we could define I it.
My suggestion:
* I could implement the `getindex` version with the `:` , but I also don't think we really need it * We don't implement the `setindex` versions, and instead change the tests. Doing a set index is dangerous because it really only works along the first column and last row. Otherwise some elements don't exist. So, we also shouldn't implement it.
Yes, I'm happy for you to change these tests, if it's required that one can only set the lower triangle that's fine, there should just be some interface to set some index and ranges of indices manually as use is outlined in that test.
At the same time though, indexing trailing dimensions with
:
works. And indexing the lower triangle with ranges works as well. So something likevor[1:10,1,:,:]
is always possible.
This is great, will that return a 10xNxM Array? I believe the way more common use case will be [:,1]
for a LowerTriangularArray{T,3}
and [:,1,1]
for 4D, meaning slicing out a lower triangular matrix from some 3D/4D array, that should return a LowerTriangularMatrix
again, which a lot of our functionality is currently built around. Sure, the dynamical core will then see a move to deal with 3D/4D directly but in the same way as A[:,1]
of matrix A
return (allocating) a Vector
I'd love to have an interface where I can just say
k, t = 1, 1
vor = simulation.prognostic_variables.vor[:,k,t]
heatmap(gridded(vor))
to obtain the spectral vorticity vor
on layer k
at timestep t
so that we have a convenient interface to inspect 3D/4D variables.
Also AssiociatedLegendrePolynomials.jl indexes 2D arrays with: [CartesianIndex(), i, j]. So a leading empty index. Apparently AbstractArrays allow that, so I had to add a function that allows that as well. Just maybe something to watch out for when adjusting the transforms to work with the N-dimensional arrays. Maybe there will be another problem there.
Ouh, why is that? @jmert any insights? But technically the legendre polynomials don't have to be LowerTriangularArray
, they are only precomputed when creatig the SpectralTransform
and only read/recomputed in spectral!
and gridded!
and never exposed to the user or to other operations that should be convenient.
This is great, will that return a 10xNxM Array? I believe the way more common use case will be
[:,1]
for aLowerTriangularArray{T,3}
and[:,1,1]
for 4D, meaning slicing out a lower triangular matrix from some 3D/4D array, that should return aLowerTriangularMatrix
again, which a lot of our functionality is currently built around. Sure, the dynamical core will then see a move to deal with 3D/4D directly but in the same way asA[:,1]
of matrixA
return (allocating) aVector
I'd love to have an interface where I can just sayk, t = 1, 1 vor = simulation.prognostic_variables.vor[:,k,t] heatmap(gridded(vor))
to obtain the spectral vorticity
vor
on layerk
at timestept
so that we have a convenient interface to inspect 3D/4D variables.
Yes that's a good point. Currently all getindex
return an Array (or CuArray or whatever the LowerTriangularArray
wraps around). But it would be easy to implement the behaviour to return LowerTriangularMatrix
when the leading index is :
. The "easy" way to implement this will almost surely be allocating though.
I guess we could directly also implement an approtiate view
though, because this already works as well:
L = rand(LowerTriangularArray{Float32}, 10, 10, 5)
v = view(L.data,:,1)
L2 = LowerTriangularArray(v, 10, 10) # no allocations caused
L2[1,3] # can be indexed as a regular LowerTriangularMatrix
Yes that's a good point. Currently all
getindex
return an Array (or CuArray or whatever theLowerTriangularArray
wraps around). But it would be easy to implement the behaviour to returnLowerTriangularMatrix
when the leading index is:
. The "easy" way to implement this will almost surely be allocating though.
Allocating is fine, and probably even preferred because that's what a user (incl us) working with arrays would expect, see
julia> A = rand(3,3)
3×3 Matrix{Float64}:
0.737496 0.497421 0.7867
0.115028 0.231549 0.465543
0.585832 0.309741 0.569258
julia> v = A[:,1] # allocate vector from first column
3-element Vector{Float64}:
0.7374955706774863
0.11502825073154921
0.5858324735292674
julia> v[1] = 0 # changing first index
0
julia> A # doesn't change the original array
3×3 Matrix{Float64}:
0.737496 0.497421 0.7867
0.115028 0.231549 0.465543
0.585832 0.309741 0.569258
I guess we could directly also implement an approtiate
view
though, because this already works as well:L = rand(LowerTriangularArray{Float32}, 10, 10, 5) v = view(L.data,:,1) L2 = LowerTriangularArray(v, 10, 10) # no allocations caused L2[1,3] # can be indexed as a regular LowerTriangularMatrix
Okay that's pretty cool in case someone wants to do heavy data analysis avoiding allocations, but I see views more as nice-to-have because I believe in the dynamical core we'll eventually just operate on the whole arrays and for user interface allocations are in most cases just fine!
Also AssiociatedLegendrePolynomials.jl indexes 2D arrays with: [CartesianIndex(), i, j]. So a leading empty index. Apparently AbstractArrays allow that, so I had to add a function that allows that as well. Just maybe something to watch out for when adjusting the transforms to work with the N-dimensional arrays. Maybe there will be another problem there.
Ouh, why is that? @jmert any insights? But technically the legendre polynomials don't have to be
LowerTriangularArray
, they are only precomputed when creatig theSpectralTransform
and only read/recomputed inspectral!
andgridded!
and never exposed to the user or to other operations that should be convenient.
The reason for the CartesianIndex()
is that there's only a single implementation of the Legendre polynomial calculation which is used for all combinations of scalar or vector inputs leading to scalar ((ell, m)
), vector ((0:lmax, m)
) and matrix (0:lmax, 0:mmax)
) outputs. The 0-dimensional CartesianIndex()
allows for ( / is a consequence of, here) using generic indexing that doesn't care about the input shape:
julia> let x = 0.5
first(CartesianIndices(map(Base.Slice, axes(x))))
end
CartesianIndex()
julia> let x = zeros(1)
first(CartesianIndices(map(Base.Slice, axes(x))))
end
CartesianIndex(1,)
julia> let x = zeros(1, 1)
first(CartesianIndices(map(Base.Slice, axes(x))))
end
CartesianIndex(1, 1)
I added the indexing with leading :
, but there's a problem:
Currently the code gets stuck on initialising PrimitiveDryModel
and PrimitiveWetModel
. I tried initialising every component individually, and this works without problems, but somehow calling PrimitiveWetModel()
just never finishes. I don't know why. BarotropicModel
and ShallowWaterModel
work fine.
So this works fine:
# copied from PrimitiveWetModel (without type information)
using Julia
spectral_grid = SpectralGrid()
geometry = Geometry(spectral_grid)
# DYNAMICS
dynamics::Bool = true
planet = Earth(spectral_grid)
atmosphere = EarthAtmosphere(spectral_grid)
coriolis= Coriolis(spectral_grid)
geopotential = Geopotential(spectral_grid)
adiabatic_conversion = AdiabaticConversion(spectral_grid)
particle_advection = NoParticleAdvection()
initial_conditions = InitialConditions(PrimitiveWet)
# BOUNDARY CONDITIONS
orography = EarthOrography(spectral_grid)
land_sea_mask = LandSeaMask(spectral_grid)
ocean = SeasonalOceanClimatology(spectral_grid)
land = SeasonalLandTemperature(spectral_grid)
solar_zenith = SpeedyWeather.WhichZenith(spectral_grid, planet)
albedo = AlbedoClimatology(spectral_grid)
soil = SeasonalSoilMoisture(spectral_grid)
vegetation = VegetationClimatology(spectral_grid)
# PHYSICS/PARAMETERIZATIONS
physics::Bool = true
clausius_clapeyron = ClausiusClapeyron(spectral_grid, atmosphere)
boundary_layer_drag = BulkRichardsonDrag(spectral_grid)
temperature_relaxation = NoTemperatureRelaxation(spectral_grid)
vertical_diffusion = SpeedyWeather.BulkRichardsonDiffusion(spectral_grid)
surface_thermodynamics = SurfaceThermodynamicsConstant(spectral_grid)
surface_wind = SurfaceWind(spectral_grid)
surface_heat_flux = SurfaceSensibleHeat(spectral_grid)
evaporation = SurfaceEvaporation(spectral_grid)
large_scale_condensation = ImplicitCondensation(spectral_grid)
convection = SimplifiedBettsMiller(spectral_grid)
shortwave_radiation = TransparentShortwave(spectral_grid)
longwave_radiation = JeevanjeeRadiation(spectral_grid)
# NUMERICS
device_setup = DeviceSetup(SpeedyWeather.CPUDevice())
time_stepping = Leapfrog(spectral_grid)
spectral_transform = SpectralTransform(spectral_grid)
implicit = ImplicitPrimitiveEquation(spectral_grid)
horizontal_diffusion = HyperDiffusion(spectral_grid)
vertical_advection = CenteredVerticalAdvection(spectral_grid)
hole_filling = ClipNegatives(spectral_grid)
# OUTPUT
output = OutputWriter(spectral_grid, PrimitiveWet)
callbacks = Dict{Symbol, SpeedyWeather.AbstractCallback}()
feedback = Feedback()
But, just doing
PrimitiveWetModel()
doesn't. (This is on Julia 1.9 macOS x86_64)
Same as #508?
Looks very similar, but not on Apple Silicon. I got a i5 in my laptop.
Ok, sorry, the problem has nothing to do with this branch. It also occurs on the main branch for me. I just haven't run the main branch in a while for those models.
All tests pass on 1.10.
I'd still like to test it on a proper GPU, before merging, I'll probably do that later today.
Also, still should do a benchmark, now that that's available
Ok, sorry, the problem has nothing to do with this branch. It also occurs on the main branch for me. I just haven't run the main branch in a while for those models.
It must be somehow related to the many parameters these structs have. I can check tomorrow, but I'd go about defining a shorter PrimitiveDryModel with fewer fields. Or there's one field that causes some kind of dispatch loop when used in model creation. After all we are doing PrimitiveDryModel(; a, b,c)
and so Julia figures out the parameters of PrimitiveDryModel
somehow...
Ok, sorry, the problem has nothing to do with this branch. It also occurs on the main branch for me. I just haven't run the main branch in a while for those models.
It must be somehow related to the many parameters these structs have. I can check tomorrow, but I'd go about defining a shorter PrimitiveDryModel with fewer fields. Or there's one field that causes some kind of dispatch loop when used in model creation. After all we are doing
PrimitiveDryModel(; a, b,c)
and so Julia figures out the parameters ofPrimitiveDryModel
somehow...
I did some trial and error just now, and I think it might be just too many parameters. For a moment I thought I found the cultprit, but then commenting in other fields again got the thing stuck. So I guess right now that Julia 1.9 struggles with determining so many types / fields.
Benchmark results are pretty much identical and it runs with CuArray as well.
README.md.ltm_branch.md README.md.main_branch.md
From my side, I think this PR is pretty much finished.
After reviewing the RingGrids PR, I noticed that we may need to add another similar
function for the GPU broadcasting. But right now, I am not even sure when exactly this function matters, as all "regular" broadcasting tests already pass without it as well.
This wasn't possible
julia> A = randn(LowerTriangularMatrix, 33, 32, 5)
ERROR: MethodError: no method matching randn(::Random.TaskLocalRNG, ::Type{LowerTriangularMatrix})
I've included rand
, randn
into the loop where also zeros
, ones
are defined and removed over methods
Edit: Now f(LowerTriangularMatrix, l, m)
as well as f(LowerTriangularArray, l, m)
(2D) are possible given that Matrix <: Array, but for 3D+ only f(LowerTriangularArray, l, m, k...)
is possible with f
any of the constructors ones, zeros, rand, randn
.
I still get this with a 5x5x5 matrix
julia> L[6,5,1] # should throw
ERROR: BoundsError: attempt to access 5×5×5 LowerTriangularArray{Float64, 3, Matrix{Float64}} at index [6, 5]
julia> L[5,5,6] # should also throw!
3.6338242925240716e174
don't know why the tests passed for you?
EDIT: This is resolved below.
I'm replacing your
@inline function Base.getindex(L::LowerTriangularArray{T,N}, I::Vararg{Integer,N}) where {T,N}
i, j = I[1:2]
@boundscheck (0 < i <= L.m && 0 < j <= L.n) || throw(BoundsError(L, (i, j)))
j > i && return zero(getindex(L.data, 1, I[3:end]...)) # to get a zero element in the correct shape, we just take the zero element of some valid element, there are probably faster ways to do this, but I don't know how (espacially when ":" are used), and this is just a fallback anyway
k = ij2k(i, j, L.m)
@inbounds r = getindex(L.data, k, I[3:end]...)
return r
end
with
@inline function Base.getindex(L::LowerTriangularArray{T,N}, I::Vararg{Integer,N}) where {T,N}
i, j = I[1:2]
@boundscheck (0 < i <= L.m && 0 < j <= L.n) || throw(BoundsError(L, (i, j)))
@boundscheck j > i && return zero(T)
k = ij2k(i, j, L.m)
return getindex(L.data, k, I[3:end]...)
end
@inline
the @inbounds
from the caller with propagate to the getindex(L.data. ...)
with the @inbounds
there's no check that the I
indices are in bounds!@boundscheck
around the j > i
case which is then compiled away when using @inbounds
. Zero should be returned for [1,2]
for example, but I think we can expect people to only access the lower triangle in an @inbounds
case, that also saves one >
check for performanceT
e.g. Float32 and there is no need to request the data element at that index?Yes, you are right! Thanks!
Okay I think this is almost ready to go!! I still want to check that @inbounds
is propagated correctly. Also the module is still called LowerTriangularMatrices
but I think we sould keep that to highlight it's for lower triangular matrices even if we stack them up in n-dimensional arrays not that someone thinks the lower triangle becomes a quarter of a pyramid in 3D!
I still want to check that @inbounds is propagated correctly
No they weren't, now all is corrected I believe. It's a bit weird, I needed some Base.@propagate_inbounds
(which also inlines) on getindex
but not on setindex!
. But now always testing with [1,2] which is an invalid index hence returning 0 for getindex or throwing a boundserror for setindex! but with @inbounds
the ij2k
function actually just returns [m,1] which is wrong but a valid index, so it's a neat way to test whether the @boundscheck
is complied away or not (actually reaching outside of the memory for tests isn't a good idea...)
I also just learned that github actions always checks bounds regardless @inbounds
annotations so I've commented those tests but they pass locally.
I also just learned that github actions always checks bounds regardless
@inbounds
annotations so I've commented those tests but they pass locally.
interesting!
I don't think this is our intention (with Array
or JLArray
)
julia> JL = adapt(JLArray, L)
5×5 LowerTriangularArray{Float64, 2, JLArray{Float64, 1}}:
0.533073 0.0 0.0 0.0 0.0
0.879798 0.921815 0.0 0.0 0.0
0.313754 0.224968 0.229314 0.0 0.0
0.811464 0.603529 0.325419 0.481392 0.0
0.64585 0.0579084 0.509436 0.163703 0.540856
julia> JL + JL # looks good
5×5 LowerTriangularArray{Float64, 2, JLArray{Float64, 1}}:
1.06615 0.0 0.0 0.0 0.0
1.7596 1.84363 0.0 0.0 0.0
0.627507 0.449935 0.458628 0.0 0.0
1.62293 1.20706 0.650837 0.962785 0.0
1.2917 0.115817 1.01887 0.327406 1.08171
julia> JL .+ JL # escapes and also where does the upper triangle come from?
5×5 JLArray{Float64, 2}:
1.06615 1.2917 1.20706 0.458628 0.650837
1.7596 1.84363 0.115817 0.650837 1.01887
0.627507 0.449935 0.458628 1.01887 0.962785
1.62293 1.20706 0.650837 0.962785 0.327406
1.2917 0.115817 1.01887 0.327406 1.08171
I think the upper triangle is from evaluating JL[i,j]
for j > i
with @inbounds
which maps (incorrectly because the indices aren't in bounds) back to indices in the lower triangle
The corresponding operation for ring grids is fine though!
I've removed the directly defined arithmetic for LowerTriangularMatrix
as they were obscuring incorrectly working broadcasting. Now it's done via broadcasting and works largely except for:
Similar to #520 I've defined two broadcasting styles, one for CPU <: Broadcast.AbstractArrayStyle{N}
and one for GPU <: GPUArrays.AbstractGPUArrayStyle{N}
which solves the problem from above. There is however, still a problem with non-zero preserving operations like .==
julia> L .== L
ERROR: BoundsError: attempt to access 5×5 LowerTriangularMatrix{Bool} at index [1, 2]
because this currently returns a LowerTriangularMatrix
although it should be a BitMatrix
(where a 1
can be written into the upper triangle). The corresponding operation with JLArrays
works though (because the indexing is done differently only hitting the lower triangle)
julia> JL .== JL
5×5 LowerTriangularArray{Bool, 2, JLArray{Bool, 1}}:
1 0 0 0 0
1 1 0 0 0
1 1 1 0 0
1 1 1 1 0
1 1 1 1 1
which is probably not what one would expect maybe, but not too bad worst case. I've just changed all tests to use L == L
or JL == JL
instead which works fine.
Next issue,
julia> L
5×5 LowerTriangularMatrix{Float64}:
0.296704 0.0 0.0 0.0 0.0
0.398559 0.926045 0.0 0.0 0.0
0.958507 0.130742 0.300892 0.0 0.0
0.867598 0.842258 0.588318 0.977474 0.0
0.869255 0.989118 0.742295 0.500998 0.221757
julia> L2
5×5 LowerTriangularMatrix{ComplexF64}:
0.750678+0.558921im 0.0+0.0im … 0.0+0.0im 0.0+0.0im
0.825657+0.14191im 0.0197592+0.542655im 0.0+0.0im 0.0+0.0im
0.866794+0.687417im 0.55294+0.572049im 0.0+0.0im 0.0+0.0im
0.219007+0.697757im 0.127504+0.590993im 0.454295+0.691951im 0.0+0.0im
0.135136+0.433922im 0.668148+0.740522im 0.259046+0.285174im 0.21516+0.0390588im
julia> L2 .= L
ERROR: conflicting broadcast rules defined
Broadcast.BroadcastStyle(::Main.SpeedyWeather.LowerTriangularMatrices.LowerTriangularStyle{2, Vector{ComplexF64}}, ::Main.SpeedyWeather.LowerTriangularMatrices.LowerTriangularStyle{2, Vector{Float64}}) = Main.SpeedyWeather.LowerTriangularMatrices.LowerTriangularStyle{2, Vector{ComplexF64}}()
Broadcast.BroadcastStyle(::Main.SpeedyWeather.LowerTriangularMatrices.LowerTriangularStyle{2, Vector{Float64}}, ::Main.SpeedyWeather.LowerTriangularMatrices.LowerTriangularStyle{2, Vector{ComplexF64}}) = Main.SpeedyWeather.LowerTriangularMatrices.LowerTriangularStyle{2, Vector{Float64}}()
One of these should be undefined (and thus return Broadcast.Unknown).
Can probably be resolved by making sure the type T
in the style isn't parametric...
Resolved now by making the ArrayType
in the broadcasting style non-parametric
julia> L = ones(LowerTriangularMatrix{Float32},3,3)
3×3 LowerTriangularMatrix{Float32}:
1.0 0.0 0.0
1.0 1.0 0.0
1.0 1.0 1.0
julia> L2 = rand(LowerTriangularMatrix{ComplexF64}, 3, 3)
3×3 LowerTriangularMatrix{ComplexF64}:
0.406483+0.885732im 0.0+0.0im 0.0+0.0im
0.85852+0.0421888im 0.811017+0.694692im 0.0+0.0im
0.785031+0.28899im 0.123636+0.516475im 0.482183+0.623109im
julia> L + L2
3×3 LowerTriangularMatrix{ComplexF64}:
1.40648+0.885732im 0.0+0.0im 0.0+0.0im
1.85852+0.0421888im 1.81102+0.694692im 0.0+0.0im
1.78503+0.28899im 1.12364+0.516475im 1.48218+0.623109im
Great that the PR is merged!
Just one thing: as we discussed here (or somewhere else?) before, broadcasting is in this case often slower than the direct arithmetic functions that we defined for (+
,*
,scale!
) etc. If we intend to use this in the dynamical core, it might be worth to add those functions again.
Yes, good call. I guess you really mean writing down the loops not just propagating the broadcasting to .data
?
I mean as it was done before and removed in https://github.com/SpeedyWeather/SpeedyWeather.jl/pull/503/commits/20b093816da62b73fdfb7c6ccf0a1b1e9bb82d74
This is the old test I did (copied from our Zulip conversation) that used those functions
using SpeedyWeather, BenchmarkTools
NF = Float32
L_cpu = randn(LowerTriangularArray{NF}, 100, 100, 10)
julia> @benchmark L_cpu .* 4.5f0
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 103.897 μs … 17.866 ms ┊ GC (min … max): 0.00% … 98.21%
Time (median): 268.755 μs ┊ GC (median): 0.00%
Time (mean ± σ): 305.518 μs ± 557.126 μs ┊ GC (mean ± σ): 11.72% ± 6.48%
▆ ▃▆███▇▅▄▃▃▂▂▃▂▂▂▂▃▃▂▂▂▁▁ ▁ ▃
███▇▇▇▆▃▆▅▆▅▅▃▄▅▄▅▃▅▃▄▄▁▅██████████████████████████████▇▆▇▅▄▅ █
104 μs Histogram: log(frequency) by time 451 μs <
Memory estimate: 390.78 KiB, allocs estimate: 4.
julia> @benchmark L_cpu * 4.5f0
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 14.140 μs … 18.110 ms ┊ GC (min … max): 0.00% … 99.06%
Time (median): 88.011 μs ┊ GC (median): 0.00%
Time (mean ± σ): 112.584 μs ± 483.033 μs ┊ GC (mean ± σ): 19.44% ± 4.72%
▄▃ ▄▇██▇▆▆▆▅▄▃▂▁▁ ▁▁▁ ▁ ▁ ▃
██▆▄▆▆▁▄▅▅▃▁▁▃▄▄▁▃▁▁▁▄▅████████████████████████▇▇▇▇██▇█▇█▇▇▇▆ █
14.1 μs Histogram: log(frequency) by time 182 μs <
Memory estimate: 197.39 KiB, allocs estimate: 3.
Of course yes, they should be added in, I forgot that there's such a difference!
So, here’s a first draft for the rework of the LowerTriangularMatrices to be GPU compatible and N-dimensional.
I implemented it to be agnostic of the actual array type and also avoided any form of custom kernels. All
for
-loops are gone and replaced with operations on thedata
object or some fancy indexing in some parts. In the current version I kept some of the old functions and restrict them, so that they are only used for the CPU LowerTriangularMatrix, because they might be faster (not checked).I kept all old tests in order to make sure that everything still works exactly as before with the old syntax as well, as long as one uses
LowerTriangularMatrix
as a type. I did at a lot of new tests, and in the end we might also merge a few of these tests to keep the test suite a bit cleaner. GPU tests are done with JLArrays.There’s still some things left to do:
dotview
or maybe also some additionalgetindex
/view
-like functions to reduce/avoid any form of allocations when the model calls individual layers (this will probably be done later)copyto!
function (doesn't work with JLArrays, but should with Array and CuArray)