LowerTriangularMatrices: N-dimensional and GPU ready

maximilian-gelbrecht commented 6 months ago

So, here’s a first draft for the rework of the LowerTriangularMatrices to be GPU compatible and N-dimensional.

I implemented it to be agnostic of the actual array type and also avoided any form of custom kernels. All for-loops are gone and replaced with operations on the data object or some fancy indexing in some parts. In the current version I kept some of the old functions and restrict them, so that they are only used for the CPU LowerTriangularMatrix, because they might be faster (not checked).

I kept all old tests in order to make sure that everything still works exactly as before with the old syntax as well, as long as one uses LowerTriangularMatrix as a type. I did at a lot of new tests, and in the end we might also merge a few of these tests to keep the test suite a bit cleaner. GPU tests are done with JLArrays.

There’s still some things left to do:

[x] N-dim broadcasting on CPU
[x] broadcasting on GPU
[x] more tests for GPU broadcasting
[x] Docstring updates
[x] Documentation updates
[x] Implement additional function that might be needed, like e.g. dotview or maybe also some additional getindex/view-like functions to reduce/avoid any form of allocations when the model calls individual layers (this will probably be done later)
[x] Most To-Do notes
[x] Some minor To-Do notes left
[x] test truncating copyto! function (doesn't work with JLArrays, but should with Array and CuArray)
[x] Actually test it on GPU once, not just with JLArrays (not in CI, just manually)
[x] Make sure all tests pass, not just those of the submodule
[x] Benchmark

milankl commented 6 months ago

With this PR it seems that the roadmap is as follows

Implement LowerTriangularArray so that the current LowerTriangularMatrix is the 2D special case of it (this PR)
Implement the ring grids like OctahedralGaussianArray so that the current OctahedralGaussianGrid is the 2D special case of it (another PR)
Replace the current layered structure of the prognostic and diagnostic variables with 3D/4D LowerTriangularArray and 2D/3D OctahedralGaussianArray (or other ring grids) and adapt the kernels accordingly (another PR)

I don't really see a way to split up the last point into several PRs that'll just be a monster PR then again I guess, but that's probably fine.

maximilian-gelbrecht commented 6 months ago

Yes this sounds reasonable.

For the third point, in theory we could split it into multiple parts, by defining temporary aliases for these structures that allow to access the n-dimensional arrays like the vectors of vectors.

It's also a question if the third point should already include proper GPU/Kernelabstraction kernels or not.

milankl commented 6 months ago

It's also a question if the third point should already include proper GPU/Kernelabstraction kernels or not.

I think it'd like to go step by step. Simply because changing the indexing structure may introduce errors that we could spot more easily if we don't also do GPU kernels with KernelAbstractions in one go. However, one could already come up with a set of allowed operations / recommended structure of how to write the kernels so that, while still everything is CPU-based and KernelAbstractions-independent we have a kernel that is super close to what it's supposed to be.

So for example if you have a look at the kernels and say what we shouldn't do for the GPU then I'd be happy to write the kernels already for that in mind.

maximilian-gelbrecht commented 6 months ago

The PR is coming along nicely, but the big bit that's missing is the broadcasting on GPU. I've had a quick look into it, and I am not quite certain to proceed there right now. To recap:

Our LowerTriangleArray stores only the lower triangle of an array/matrix by flattening the first two dimensions. It wraps around an array data::AbstractArray{T,N-1}, but is a subtype of AbstractArray{T,N}. By defining custom getindex we can either index it with N-1 indices by directly indexing the data Array, or by converting N indices appropriately, so that it can actually behave like a AbstractArray{T,N} in most situations.

For the broadcast that's a problem, as it wants to access the upper triangle that doesn't exists, and therefore quickly either causes BoundsErrors or incorrect results. On CPU I was able to fix this by defining my own broadcast style and extending the copyto!(dest::LowerTriangularArray, bc::Broadcasted) function, so that it correctly copies over only the lower triangle. On GPU, I looked into what's implemented over at GPUArrays.jl, but I am not quite certain if we have to implement a copyto!/_copyto! with a new kernel here, or if it's better to extent one of the other functions related to the broadcasting and the broadcast style that's defined by GPUArrays.jl. Is it possible to change the axes of the object just for the broadcasting? There's the Broadcast.preprocess function. Maybe that's something that could work?

What we want is that any broadcasting just is directly applied to all elements of the data::AbstractArray{T,N-1} object LowerTriangleArray wraps around, while ideallyLowerTriangleArray <: AbstractArray{T,N}.

milankl commented 6 months ago

@vchuravy do you have some insights on how to define array broadcasting for the GPU? ☝🏼

maximilian-gelbrecht commented 5 months ago

I made an initial draft for GPU broadcasting that's working, but uses the k2ij function, so it is quite costly. I more or less just adjusted the code from GPUArrays.jl for it.

We could either:

Keep it like that, and not use broadcasting for performance critical stuff
Pre-compute all k2ij and keep them as a field of all LowerTriangularArray
Intermediate way: at every broadcasting compute the full k2ij matrix at once before the kernel is launched. I suspect there's a faster way to compute all indices at once than the formula involving the root to compute them individually. The only reason I didn't implement this yet is that I couldn't easily think of a way that's GPU compatible as well. On CPU it's just a nested loop counting up. That's not easily doable on GPU, but maybe it's also fine to always do it on CPU...

I think the GPU broadcasting needs to be tested a little bit more though.

Otherwise I can slowly begin the clean up of the PR, and update the doc.

@milankl what's your opinion on the last open points from the code review, e.g. what eachharmonic should do?

maximilian-gelbrecht commented 5 months ago

I updated the documentation of the LowerTriangularMatrices submodule, but I didn't update all uses of LowerTriangularMatrix throughout the documentation. As LowerTriangularMatrix still works as before, as an alias of 2D LowerTraingualarArray we could also leave it like that.

milankl commented 5 months ago

Awesome this is fantastic! Thank you so much 👏 I'll try to mirror your efforts here for the RingGrids then as the next step.

@milankl what's your opinion on the last open points from the code review, e.g. what eachharmonic should do?

I'm for deprecating both eachharmonic and eachgridpoint in favour of extending eachindex. I introduced them only because I thought they'd add more clarity but I actually find them confusing because lower triangular matrices don't have to be harmonics and there's no win over using eachgridpoint(grid) compares to eachindex(grid). Have we tested whether lower triangular matrices being an Abstractmatrix but eachindex only hitting the lower triangle is a problem?

milankl commented 5 months ago

I made an initial draft for GPU broadcasting that's working, but uses the k2ij function, so it is quite costly. I more or less just adjusted the code from GPUArrays.jl for it.

We could either:

Keep it like that, and not use broadcasting for performance critical stuff

Pre-compute all k2ij and keep them as a field of all LowerTriangularArray

Intermediate way: at every broadcasting compute the full k2ij matrix at once before the kernel is launched. I suspect there's a faster way to compute all indices at once than the formula involving the root to compute them individually. The only reason I didn't implement this yet is that I couldn't easily think of a way that's GPU compatible as well. On CPU it's just a nested loop counting up. That's not easily doable on GPU, but maybe it's also fine to always do it on CPU...

I believe most of broadcasting is for user interface purposes but there are a few places in the dynamical core where we do @. c*A + B or similar where c::Number, A, B both lower triangular matrices. Maybe we can use that as a performance benchmark that shouldn't be much slower as it is now.

I'm tempted to go for "keep it like that" but then manually define the broadcasting operations that are performance critical so that we're not dealing with a performance regression inside the dynamical core. I'll look through the code now and put a list of these in here

milankl commented 5 months ago

Where we use broadcasting in the dynamical core:

https://github.com/SpeedyWeather/SpeedyWeather.jl/blob/ba45334ec89a55bd694bcfe66324494c981765b9/src/dynamics/tendencies.jl#L243

https://github.com/SpeedyWeather/SpeedyWeather.jl/blob/ba45334ec89a55bd694bcfe66324494c981765b9/src/dynamics/tendencies.jl#L279

https://github.com/SpeedyWeather/SpeedyWeather.jl/blob/ba45334ec89a55bd694bcfe66324494c981765b9/src/dynamics/tendencies.jl#L703

https://github.com/SpeedyWeather/SpeedyWeather.jl/blob/ba45334ec89a55bd694bcfe66324494c981765b9/src/dynamics/tendencies.jl#L733

https://github.com/SpeedyWeather/SpeedyWeather.jl/blob/ba45334ec89a55bd694bcfe66324494c981765b9/src/dynamics/implicit.jl#L409

https://github.com/SpeedyWeather/SpeedyWeather.jl/blob/ba45334ec89a55bd694bcfe66324494c981765b9/src/dynamics/geopotential.jl#L125

https://github.com/SpeedyWeather/SpeedyWeather.jl/blob/ba45334ec89a55bd694bcfe66324494c981765b9/src/dynamics/virtual_temperature.jl#L101

I think that's it.

maximilian-gelbrecht commented 5 months ago

Have we tested whether lower triangular matrices being an Abstractmatrix but eachindex only hitting the lower triangle is a problem?

eachindex is defined as

eachindex(L::LowerTriangularArray) = eachindex(L.data)

so it just returns the valid indices of the underlying data. This overrides any defaults from the AbstractArray interface

maximilian-gelbrecht commented 5 months ago

Where we use broadcasting in the dynamical core:

....

I think that's it.

So, basically that's:

Multiplication with a number
Addition/Subscration of LTAs
Elementwise multiplication with another LTA (not sure?)

The first two we have already covered. The last one we can easily define as

elementwise_mul(L1::LowerTriangularArray, L2::LowerTriangularArray) = LowerTriangularArray(L1.data .* L2.data, L1.m, L1.n)

(with some boundscheck added)

milankl commented 5 months ago

* Elementwise multiplication with another LTA (not sure?)

Yes, we technically don't need this as this is mathematically not a well defined operation for two sets of spherical harmonic coefficients, but it might come in handy if we want to broadcast operations like $\nabla^2 A$, where the $\nabla^2$ can be thought of as a matrix. So I'd add this while we're at it and if it doesn't cause complications.

maximilian-gelbrecht commented 5 months ago

Where we use broadcasting in the dynamical core:

...

I think that's it.

Could quickly write down for all of them if it is a number, a LTA or something else? For most I know, but with some I am not sure. Then I'd add all of these kind of operations to the tests for the broadcast. Then, we have that covered as well.

maximilian-gelbrecht commented 5 months ago

@milankl There's something wrong with the documentation build, and I don't know what. It literally takes hours. Do you have any idea? This was already the case before I made any changes to the documentation, I am just noticing now. The report shows several errors from the analysis.md but I don't know if that's really the culprit.

maximilian-gelbrecht commented 5 months ago

Huh, now the documentation build is just failing relatively quickly. The only important change that I did was to move JLArrays from a direct dependency to [extras] and [test]. Strange...

milankl commented 5 months ago

This why

 ┌ Error: failed to run `@example` block in src/analysis.md:69-74
│ ```@example analysis
│ using SpeedyWeather
│ spectral_grid = SpectralGrid(trunc=31, nlev=1)
│ model = ShallowWaterModel(;spectral_grid)
│ simulation = initialize!(model)
│ ```
│   exception =
│    MethodError: no method matching (LowerTriangularMatrix{ComplexF64})(::UndefInitializer, ::Int64, ::Int64)
│    Stacktrace:
│      [1] spectral(map::FullGaussianGrid{Float64}, S::SpectralTransform{Float64})
│        @ SpeedyWeather.SpeedyTransforms ~/work/SpeedyWeather.jl/SpeedyWeather.jl/src/SpeedyTransforms/spectral_transform.jl:564

maximilian-gelbrecht commented 5 months ago

I don't think this is the reason why the CI jobs are stuck for 6 hours though. It's a bit strange.

Next week I'll adjust it, so that really all tests of SpeedyWeather pass.

There's one initial change I already had to make:

const LowerTriangularMatrix{T} = LowerTriangularArray{T,2,Vector{T}

so it's always a CPU array. Now it should be really backwards compatible.

Another thing I noticed the tests for spectral_gradient, want to index the LowerTriangularMatrix with :

https://github.com/SpeedyWeather/SpeedyWeather.jl/blob/35314c04e5930ad4eb41f4e2a6d4663227439c5a/test/spectral_gradients.jl#L195-L200

Previously this worked just via the fallback/default routines from AbstractArray. As we have a bit more complex indexing now with the n-dimensional case this doesn't work anymore, but we could define I it.

My suggestion:

I could implement the getindex version with the : , but I also don't think we really need it
We don't implement the setindex versions, and instead change the tests. Doing a set index is dangerous because it really only works along the first column and last row. Otherwise some elements don't exist. So, we also shouldn't implement it.

At the same time though, indexing trailing dimensions with : works. And indexing the lower triangle with ranges works as well. So something like

vor[1:10,1,:,:]

is always possible.

Also AssiociatedLegendrePolynomials.jl indexes 2D arrays with: [CartesianIndex(), i, j]. So a leading empty index. Apparently AbstractArrays allow that, so I had to add a function that allows that as well. Just maybe something to watch out for when adjusting the transforms to work with the N-dimensional arrays. Maybe there will be another problem there.

white-alistair commented 5 months ago

Just dropping in here to say nice work :partying_face:

milankl commented 5 months ago

Just dropping in here to say nice work 🥳

You're always welcome to join the party!! It's fun, I promise! Have you seen the new cool Makie extension? And also I've documented convection and condensation, they're much more stable and mature now.

I don't think this is the reason why the CI jobs are stuck for 6 hours though. It's a bit strange.

Stuck sounds like #483 ?

const LowerTriangularMatrix{T} = LowerTriangularArray{T,2,Vector{T}
so it's always a CPU array. Now it should be really backwards compatible.

That's great yeah, I think a lot of the interactive stuff would happen on the CPU anyway, so it's fine to fix it to a vector.

Another thing I noticed the tests for spectral_gradient, want to index the LowerTriangularMatrix with :

https://github.com/SpeedyWeather/SpeedyWeather.jl/blob/35314c04e5930ad4eb41f4e2a6d4663227439c5a/test/spectral_gradients.jl#L195-L200

Previously this worked just via the fallback/default routines from AbstractArray. As we have a bit more complex indexing now with the n-dimensional case this doesn't work anymore, but we could define I it.

My suggestion:
* I could implement the `getindex` version with the `:` , but I also don't think we really need it

* We don't implement the `setindex` versions, and instead change the tests. Doing a set index is dangerous because it really only works along the first column and last row. Otherwise some elements don't exist. So, we also shouldn't implement it.

Yes, I'm happy for you to change these tests, if it's required that one can only set the lower triangle that's fine, there should just be some interface to set some index and ranges of indices manually as use is outlined in that test.

At the same time though, indexing trailing dimensions with : works. And indexing the lower triangle with ranges works as well. So something like
vor[1:10,1,:,:]
is always possible.

This is great, will that return a 10xNxM Array? I believe the way more common use case will be [:,1] for a LowerTriangularArray{T,3} and [:,1,1] for 4D, meaning slicing out a lower triangular matrix from some 3D/4D array, that should return a LowerTriangularMatrix again, which a lot of our functionality is currently built around. Sure, the dynamical core will then see a move to deal with 3D/4D directly but in the same way as A[:,1] of matrix A return (allocating) a Vector I'd love to have an interface where I can just say

k, t = 1, 1
vor = simulation.prognostic_variables.vor[:,k,t]
heatmap(gridded(vor))

to obtain the spectral vorticity vor on layer k at timestep t so that we have a convenient interface to inspect 3D/4D variables.

Also AssiociatedLegendrePolynomials.jl indexes 2D arrays with: [CartesianIndex(), i, j]. So a leading empty index. Apparently AbstractArrays allow that, so I had to add a function that allows that as well. Just maybe something to watch out for when adjusting the transforms to work with the N-dimensional arrays. Maybe there will be another problem there.

Ouh, why is that? @jmert any insights? But technically the legendre polynomials don't have to be LowerTriangularArray, they are only precomputed when creatig the SpectralTransform and only read/recomputed in spectral! and gridded! and never exposed to the user or to other operations that should be convenient.

maximilian-gelbrecht commented 5 months ago

This is great, will that return a 10xNxM Array? I believe the way more common use case will be [:,1] for a LowerTriangularArray{T,3} and [:,1,1] for 4D, meaning slicing out a lower triangular matrix from some 3D/4D array, that should return a LowerTriangularMatrix again, which a lot of our functionality is currently built around. Sure, the dynamical core will then see a move to deal with 3D/4D directly but in the same way as A[:,1] of matrix A return (allocating) a Vector I'd love to have an interface where I can just say
k, t = 1, 1
vor = simulation.prognostic_variables.vor[:,k,t]
heatmap(gridded(vor))
to obtain the spectral vorticity vor on layer k at timestep t so that we have a convenient interface to inspect 3D/4D variables.

Yes that's a good point. Currently all getindex return an Array (or CuArray or whatever the LowerTriangularArray wraps around). But it would be easy to implement the behaviour to return LowerTriangularMatrix when the leading index is :. The "easy" way to implement this will almost surely be allocating though.

I guess we could directly also implement an approtiate view though, because this already works as well:

L = rand(LowerTriangularArray{Float32}, 10, 10, 5)
v = view(L.data,:,1)
L2 = LowerTriangularArray(v, 10, 10) # no allocations caused
L2[1,3] # can be indexed as a regular LowerTriangularMatrix

milankl commented 5 months ago

Yes that's a good point. Currently all getindex return an Array (or CuArray or whatever the LowerTriangularArray wraps around). But it would be easy to implement the behaviour to return LowerTriangularMatrix when the leading index is :. The "easy" way to implement this will almost surely be allocating though.

Allocating is fine, and probably even preferred because that's what a user (incl us) working with arrays would expect, see

julia> A = rand(3,3)
3×3 Matrix{Float64}:
 0.737496  0.497421  0.7867
 0.115028  0.231549  0.465543
 0.585832  0.309741  0.569258

julia> v = A[:,1]    # allocate vector from first column
3-element Vector{Float64}:
 0.7374955706774863
 0.11502825073154921
 0.5858324735292674

julia> v[1] = 0   # changing first index
0

julia> A    # doesn't change the original array
3×3 Matrix{Float64}:
 0.737496  0.497421  0.7867
 0.115028  0.231549  0.465543
 0.585832  0.309741  0.569258

I guess we could directly also implement an approtiate view though, because this already works as well:
L = rand(LowerTriangularArray{Float32}, 10, 10, 5)
v = view(L.data,:,1)
L2 = LowerTriangularArray(v, 10, 10) # no allocations caused
L2[1,3] # can be indexed as a regular LowerTriangularMatrix 

Okay that's pretty cool in case someone wants to do heavy data analysis avoiding allocations, but I see views more as nice-to-have because I believe in the dynamical core we'll eventually just operate on the whole arrays and for user interface allocations are in most cases just fine!

jmert commented 5 months ago

Also AssiociatedLegendrePolynomials.jl indexes 2D arrays with: [CartesianIndex(), i, j]. So a leading empty index. Apparently AbstractArrays allow that, so I had to add a function that allows that as well. Just maybe something to watch out for when adjusting the transforms to work with the N-dimensional arrays. Maybe there will be another problem there.

Ouh, why is that? @jmert any insights? But technically the legendre polynomials don't have to be LowerTriangularArray, they are only precomputed when creatig the SpectralTransform and only read/recomputed in spectral! and gridded! and never exposed to the user or to other operations that should be convenient.

The reason for the CartesianIndex() is that there's only a single implementation of the Legendre polynomial calculation which is used for all combinations of scalar or vector inputs leading to scalar ((ell, m)), vector ((0:lmax, m)) and matrix (0:lmax, 0:mmax)) outputs. The 0-dimensional CartesianIndex() allows for ( / is a consequence of, here) using generic indexing that doesn't care about the input shape:

julia> let x = 0.5
           first(CartesianIndices(map(Base.Slice, axes(x))))
       end
CartesianIndex()

julia> let x = zeros(1)
           first(CartesianIndices(map(Base.Slice, axes(x))))
       end
CartesianIndex(1,)

julia> let x = zeros(1, 1)
           first(CartesianIndices(map(Base.Slice, axes(x))))
       end
CartesianIndex(1, 1)

maximilian-gelbrecht commented 5 months ago

I added the indexing with leading :, but there's a problem:

Currently the code gets stuck on initialising PrimitiveDryModel and PrimitiveWetModel. I tried initialising every component individually, and this works without problems, but somehow calling PrimitiveWetModel() just never finishes. I don't know why. BarotropicModel and ShallowWaterModel work fine.

So this works fine:

# copied from PrimitiveWetModel (without type information)
using Julia 
spectral_grid = SpectralGrid()
 geometry = Geometry(spectral_grid)

    # DYNAMICS
dynamics::Bool = true
planet = Earth(spectral_grid)
atmosphere = EarthAtmosphere(spectral_grid)
coriolis= Coriolis(spectral_grid)
geopotential = Geopotential(spectral_grid)
adiabatic_conversion = AdiabaticConversion(spectral_grid)
particle_advection = NoParticleAdvection()
initial_conditions = InitialConditions(PrimitiveWet)

    # BOUNDARY CONDITIONS
orography = EarthOrography(spectral_grid)
land_sea_mask = LandSeaMask(spectral_grid)
ocean = SeasonalOceanClimatology(spectral_grid)
land = SeasonalLandTemperature(spectral_grid)
solar_zenith = SpeedyWeather.WhichZenith(spectral_grid, planet)
albedo = AlbedoClimatology(spectral_grid)
soil = SeasonalSoilMoisture(spectral_grid)
vegetation = VegetationClimatology(spectral_grid)

    # PHYSICS/PARAMETERIZATIONS
    physics::Bool = true
    clausius_clapeyron = ClausiusClapeyron(spectral_grid, atmosphere)
    boundary_layer_drag = BulkRichardsonDrag(spectral_grid)
    temperature_relaxation = NoTemperatureRelaxation(spectral_grid)
    vertical_diffusion = SpeedyWeather.BulkRichardsonDiffusion(spectral_grid)
    surface_thermodynamics = SurfaceThermodynamicsConstant(spectral_grid)
    surface_wind = SurfaceWind(spectral_grid)
    surface_heat_flux = SurfaceSensibleHeat(spectral_grid)
    evaporation = SurfaceEvaporation(spectral_grid)
    large_scale_condensation = ImplicitCondensation(spectral_grid)
    convection = SimplifiedBettsMiller(spectral_grid)
    shortwave_radiation = TransparentShortwave(spectral_grid)
    longwave_radiation = JeevanjeeRadiation(spectral_grid)

    # NUMERICS
    device_setup = DeviceSetup(SpeedyWeather.CPUDevice())
    time_stepping  = Leapfrog(spectral_grid)
    spectral_transform  = SpectralTransform(spectral_grid)
    implicit = ImplicitPrimitiveEquation(spectral_grid)
    horizontal_diffusion = HyperDiffusion(spectral_grid)
    vertical_advection = CenteredVerticalAdvection(spectral_grid)
    hole_filling = ClipNegatives(spectral_grid)

    # OUTPUT
    output = OutputWriter(spectral_grid, PrimitiveWet)
    callbacks = Dict{Symbol, SpeedyWeather.AbstractCallback}()
    feedback = Feedback()

But, just doing

PrimitiveWetModel()

doesn't. (This is on Julia 1.9 macOS x86_64)

milankl commented 5 months ago

Same as #508?

maximilian-gelbrecht commented 5 months ago

Looks very similar, but not on Apple Silicon. I got a i5 in my laptop.

maximilian-gelbrecht commented 5 months ago

Ok, sorry, the problem has nothing to do with this branch. It also occurs on the main branch for me. I just haven't run the main branch in a while for those models.

maximilian-gelbrecht commented 5 months ago

All tests pass on 1.10.

I'd still like to test it on a proper GPU, before merging, I'll probably do that later today.

Also, still should do a benchmark, now that that's available

milankl commented 5 months ago

Ok, sorry, the problem has nothing to do with this branch. It also occurs on the main branch for me. I just haven't run the main branch in a while for those models.

It must be somehow related to the many parameters these structs have. I can check tomorrow, but I'd go about defining a shorter PrimitiveDryModel with fewer fields. Or there's one field that causes some kind of dispatch loop when used in model creation. After all we are doing PrimitiveDryModel(; a, b,c) and so Julia figures out the parameters of PrimitiveDryModel somehow...

maximilian-gelbrecht commented 5 months ago

Ok, sorry, the problem has nothing to do with this branch. It also occurs on the main branch for me. I just haven't run the main branch in a while for those models.

It must be somehow related to the many parameters these structs have. I can check tomorrow, but I'd go about defining a shorter PrimitiveDryModel with fewer fields. Or there's one field that causes some kind of dispatch loop when used in model creation. After all we are doing PrimitiveDryModel(; a, b,c) and so Julia figures out the parameters of PrimitiveDryModel somehow...

I did some trial and error just now, and I think it might be just too many parameters. For a moment I thought I found the cultprit, but then commenting in other fields again got the thing stuck. So I guess right now that Julia 1.9 struggles with determining so many types / fields.

maximilian-gelbrecht commented 5 months ago

Benchmark results are pretty much identical and it runs with CuArray as well.

README.md.ltm_branch.md README.md.main_branch.md

From my side, I think this PR is pretty much finished.

maximilian-gelbrecht commented 5 months ago

After reviewing the RingGrids PR, I noticed that we may need to add another similar function for the GPU broadcasting. But right now, I am not even sure when exactly this function matters, as all "regular" broadcasting tests already pass without it as well.

milankl commented 5 months ago

This wasn't possible

julia> A = randn(LowerTriangularMatrix, 33, 32, 5)
ERROR: MethodError: no method matching randn(::Random.TaskLocalRNG, ::Type{LowerTriangularMatrix})

I've included rand, randn into the loop where also zeros, ones are defined and removed over methods

Edit: Now f(LowerTriangularMatrix, l, m) as well as f(LowerTriangularArray, l, m) (2D) are possible given that Matrix <: Array, but for 3D+ only f(LowerTriangularArray, l, m, k...) is possible with f any of the constructors ones, zeros, rand, randn.

milankl commented 5 months ago

I still get this with a 5x5x5 matrix

julia> L[6,5,1]    # should throw
ERROR: BoundsError: attempt to access 5×5×5 LowerTriangularArray{Float64, 3, Matrix{Float64}} at index [6, 5]

julia> L[5,5,6]    # should also throw!
3.6338242925240716e174

don't know why the tests passed for you?

EDIT: This is resolved below.

milankl commented 5 months ago

I'm replacing your

@inline function Base.getindex(L::LowerTriangularArray{T,N}, I::Vararg{Integer,N}) where {T,N}
    i, j = I[1:2]
    @boundscheck (0 < i <= L.m && 0 < j <= L.n) || throw(BoundsError(L, (i, j)))
    j > i && return zero(getindex(L.data, 1, I[3:end]...)) # to get a zero element in the correct shape, we just take the zero element of some valid element, there are probably faster ways to do this, but I don't know how (espacially when ":" are used), and this is just a fallback anyway 
    k = ij2k(i, j, L.m)
    @inbounds r = getindex(L.data, k, I[3:end]...)
    return r
end

with

@inline function Base.getindex(L::LowerTriangularArray{T,N}, I::Vararg{Integer,N}) where {T,N}
    i, j = I[1:2]
    @boundscheck (0 < i <= L.m && 0 < j <= L.n) || throw(BoundsError(L, (i, j)))
    @boundscheck j > i && return zero(T)
    k = ij2k(i, j, L.m)
    return getindex(L.data, k, I[3:end]...)
end

because of the @inline the @inbounds from the caller with propagate to the getindex(L.data. ...) with the @inbounds there's no check that the I indices are in bounds!
I've added @boundscheck around the j > i case which is then compiled away when using @inbounds. Zero should be returned for [1,2] for example, but I think we can expect people to only access the lower triangle in an @inbounds case, that also saves one > check for performance
given arguments here are all integers the returned type should be T e.g. Float32 and there is no need to request the data element at that index?

maximilian-gelbrecht commented 5 months ago

Yes, you are right! Thanks!

milankl commented 5 months ago

Okay I think this is almost ready to go!! I still want to check that @inbounds is propagated correctly. Also the module is still called LowerTriangularMatrices but I think we sould keep that to highlight it's for lower triangular matrices even if we stack them up in n-dimensional arrays not that someone thinks the lower triangle becomes a quarter of a pyramid in 3D!

milankl commented 5 months ago

I still want to check that @inbounds is propagated correctly

No they weren't, now all is corrected I believe. It's a bit weird, I needed some Base.@propagate_inbounds (which also inlines) on getindex but not on setindex!. But now always testing with [1,2] which is an invalid index hence returning 0 for getindex or throwing a boundserror for setindex! but with @inbounds the ij2k function actually just returns [m,1] which is wrong but a valid index, so it's a neat way to test whether the @boundscheck is complied away or not (actually reaching outside of the memory for tests isn't a good idea...)

milankl commented 5 months ago

I also just learned that github actions always checks bounds regardless @inbounds annotations so I've commented those tests but they pass locally.

navidcy commented 5 months ago

I also just learned that github actions always checks bounds regardless @inbounds annotations so I've commented those tests but they pass locally.

interesting!

milankl commented 5 months ago

I don't think this is our intention (with Array or JLArray)

julia> JL = adapt(JLArray, L)
5×5 LowerTriangularArray{Float64, 2, JLArray{Float64, 1}}:
 0.533073  0.0        0.0       0.0       0.0
 0.879798  0.921815   0.0       0.0       0.0
 0.313754  0.224968   0.229314  0.0       0.0
 0.811464  0.603529   0.325419  0.481392  0.0
 0.64585   0.0579084  0.509436  0.163703  0.540856

julia> JL + JL    # looks good
5×5 LowerTriangularArray{Float64, 2, JLArray{Float64, 1}}:
 1.06615   0.0       0.0       0.0       0.0
 1.7596    1.84363   0.0       0.0       0.0
 0.627507  0.449935  0.458628  0.0       0.0
 1.62293   1.20706   0.650837  0.962785  0.0
 1.2917    0.115817  1.01887   0.327406  1.08171

julia> JL .+ JL    # escapes and also where does the upper triangle come from?
5×5 JLArray{Float64, 2}:
 1.06615   1.2917    1.20706   0.458628  0.650837
 1.7596    1.84363   0.115817  0.650837  1.01887
 0.627507  0.449935  0.458628  1.01887   0.962785
 1.62293   1.20706   0.650837  0.962785  0.327406
 1.2917    0.115817  1.01887   0.327406  1.08171

I think the upper triangle is from evaluating JL[i,j] for j > i with @inbounds which maps (incorrectly because the indices aren't in bounds) back to indices in the lower triangle

The corresponding operation for ring grids is fine though!

milankl commented 5 months ago

I've removed the directly defined arithmetic for LowerTriangularMatrix as they were obscuring incorrectly working broadcasting. Now it's done via broadcasting and works largely except for:

Similar to #520 I've defined two broadcasting styles, one for CPU <: Broadcast.AbstractArrayStyle{N} and one for GPU <: GPUArrays.AbstractGPUArrayStyle{N} which solves the problem from above. There is however, still a problem with non-zero preserving operations like .==

julia> L .== L
ERROR: BoundsError: attempt to access 5×5 LowerTriangularMatrix{Bool} at index [1, 2]

because this currently returns a LowerTriangularMatrix although it should be a BitMatrix (where a 1 can be written into the upper triangle). The corresponding operation with JLArrays works though (because the indexing is done differently only hitting the lower triangle)

julia> JL .== JL
5×5 LowerTriangularArray{Bool, 2, JLArray{Bool, 1}}:
 1  0  0  0  0
 1  1  0  0  0
 1  1  1  0  0
 1  1  1  1  0
 1  1  1  1  1

which is probably not what one would expect maybe, but not too bad worst case. I've just changed all tests to use L == L or JL == JL instead which works fine.

milankl commented 5 months ago

Next issue,

julia> L
5×5 LowerTriangularMatrix{Float64}:
 0.296704  0.0       0.0       0.0       0.0
 0.398559  0.926045  0.0       0.0       0.0
 0.958507  0.130742  0.300892  0.0       0.0
 0.867598  0.842258  0.588318  0.977474  0.0
 0.869255  0.989118  0.742295  0.500998  0.221757

julia> L2
5×5 LowerTriangularMatrix{ComplexF64}:
 0.750678+0.558921im        0.0+0.0im       …       0.0+0.0im           0.0+0.0im
 0.825657+0.14191im   0.0197592+0.542655im          0.0+0.0im           0.0+0.0im
 0.866794+0.687417im    0.55294+0.572049im          0.0+0.0im           0.0+0.0im
 0.219007+0.697757im   0.127504+0.590993im     0.454295+0.691951im      0.0+0.0im
 0.135136+0.433922im   0.668148+0.740522im     0.259046+0.285174im  0.21516+0.0390588im

julia> L2 .= L
ERROR: conflicting broadcast rules defined
  Broadcast.BroadcastStyle(::Main.SpeedyWeather.LowerTriangularMatrices.LowerTriangularStyle{2, Vector{ComplexF64}}, ::Main.SpeedyWeather.LowerTriangularMatrices.LowerTriangularStyle{2, Vector{Float64}}) = Main.SpeedyWeather.LowerTriangularMatrices.LowerTriangularStyle{2, Vector{ComplexF64}}()
  Broadcast.BroadcastStyle(::Main.SpeedyWeather.LowerTriangularMatrices.LowerTriangularStyle{2, Vector{Float64}}, ::Main.SpeedyWeather.LowerTriangularMatrices.LowerTriangularStyle{2, Vector{ComplexF64}}) = Main.SpeedyWeather.LowerTriangularMatrices.LowerTriangularStyle{2, Vector{Float64}}()
One of these should be undefined (and thus return Broadcast.Unknown).

Can probably be resolved by making sure the type T in the style isn't parametric...

milankl commented 5 months ago

Resolved now by making the ArrayType in the broadcasting style non-parametric

julia> L = ones(LowerTriangularMatrix{Float32},3,3)
3×3 LowerTriangularMatrix{Float32}:
 1.0  0.0  0.0
 1.0  1.0  0.0
 1.0  1.0  1.0

julia> L2 = rand(LowerTriangularMatrix{ComplexF64}, 3, 3)
3×3 LowerTriangularMatrix{ComplexF64}:
 0.406483+0.885732im        0.0+0.0im            0.0+0.0im
  0.85852+0.0421888im  0.811017+0.694692im       0.0+0.0im
 0.785031+0.28899im    0.123636+0.516475im  0.482183+0.623109im

julia> L + L2
3×3 LowerTriangularMatrix{ComplexF64}:
 1.40648+0.885732im       0.0+0.0im           0.0+0.0im
 1.85852+0.0421888im  1.81102+0.694692im      0.0+0.0im
 1.78503+0.28899im    1.12364+0.516475im  1.48218+0.623109im

maximilian-gelbrecht commented 5 months ago

Great that the PR is merged!

Just one thing: as we discussed here (or somewhere else?) before, broadcasting is in this case often slower than the direct arithmetic functions that we defined for (+,*,scale!) etc. If we intend to use this in the dynamical core, it might be worth to add those functions again.

milankl commented 5 months ago

Yes, good call. I guess you really mean writing down the loops not just propagating the broadcasting to .data?

maximilian-gelbrecht commented 5 months ago

I mean as it was done before and removed in https://github.com/SpeedyWeather/SpeedyWeather.jl/pull/503/commits/20b093816da62b73fdfb7c6ccf0a1b1e9bb82d74

This is the old test I did (copied from our Zulip conversation) that used those functions

using SpeedyWeather, BenchmarkTools
NF = Float32
L_cpu = randn(LowerTriangularArray{NF}, 100, 100, 10)

julia> @benchmark L_cpu .* 4.5f0
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  103.897 μs …  17.866 ms  ┊ GC (min … max):  0.00% … 98.21%
 Time  (median):     268.755 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   305.518 μs ± 557.126 μs  ┊ GC (mean ± σ):  11.72% ±  6.48%

  ▆                        ▃▆███▇▅▄▃▃▂▂▃▂▂▂▂▃▃▂▂▂▁▁ ▁           ▃
  ███▇▇▇▆▃▆▅▆▅▅▃▄▅▄▅▃▅▃▄▄▁▅██████████████████████████████▇▆▇▅▄▅ █
  104 μs        Histogram: log(frequency) by time        451 μs <

 Memory estimate: 390.78 KiB, allocs estimate: 4.

julia> @benchmark L_cpu * 4.5f0
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   14.140 μs …  18.110 ms  ┊ GC (min … max):  0.00% … 99.06%
 Time  (median):      88.011 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   112.584 μs ± 483.033 μs  ┊ GC (mean ± σ):  19.44% ±  4.72%

  ▄▃                     ▄▇██▇▆▆▆▅▄▃▂▁▁ ▁▁▁  ▁ ▁                ▃
  ██▆▄▆▆▁▄▅▅▃▁▁▃▄▄▁▃▁▁▁▄▅████████████████████████▇▇▇▇██▇█▇█▇▇▇▆ █
  14.1 μs       Histogram: log(frequency) by time        182 μs <

 Memory estimate: 197.39 KiB, allocs estimate: 3.

milankl commented 5 months ago

Of course yes, they should be added in, I forgot that there's such a difference!

SpeedyWeather / SpeedyWeather.jl

LowerTriangularMatrices: N-dimensional and GPU ready #503