Added speed improvements and CuArray support to CircShiftedArray

RainerHeintzmann commented 1 year ago

this is achieved by implementing broadcasting to CircShiftedArray. Note that all tests run except of the @inferred macro. Also the show() was overloaded in such a way, that the CuArray displays without an error or warning but the package does not need to depend on CUDA.jl. Where possible the base "circshift" is used upon materialization.

RainerHeintzmann commented 1 year ago

added now also broadcasting support for the standard ShiftedArray type.

piever commented 1 year ago

This seems like a pretty large PR (though in general I think I'm happy with most changes, it makes sense to unify the implementation of shifted and circularly shifted arrays. Is this ready for review or are you still working on it?

RainerHeintzmann commented 1 year ago

Should be stable by now and ready for review. I only added minor improvements lately. It is so large due to all the support of broadcasting, needed to avoid single elect access I CUDU, with the aim to keep the original interface. Interestingly this also made other things faster.

Von: Pietro Vertechi @.> Gesendet: Sonntag, 14. Mai 2023, 14:36 An: JuliaArrays/ShiftedArrays.jl @.> Cc: Rainer Heintzmann @.>; Author @.> Betreff: Re: [JuliaArrays/ShiftedArrays.jl] Added speed improvements and CuArray support to CircShiftedArray (PR #66)

This seems like a pretty large PR (though in general I think I'm happy with most changes, it makes sense to unify the implementation of shifted and circularly shifted arrays. Is this ready for review or are you still working on it?

— Reply to this email directly, view it on GitHubhttps://github.com/JuliaArrays/ShiftedArrays.jl/pull/66#issuecomment-1546890298, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABZVV2ZTZUPLEV2L4LZ4K7DXGDGUJANCNFSM6AAAAAAXEGRLYE. You are receiving this because you authored the thread.Message ID: @.***>

RainerHeintzmann commented 1 year ago

... there is still a little problem:

julia> foo(a :: AbstractArray{T,N}) where {T,N} = println("$T $N")
foo (generic function with 1 method)

julia> foo(ShiftedArrays.circshift(ones(4,5),(2,3)))
Union{CircShift, Float64} 2

For eltype, I returned the correct type already, but in some cases, the above is bad and leads to problems for CircShiftedArray. It would be good to ensure that this class is derived from AbstractArray without using the Union. Any ideas?

RainerHeintzmann commented 1 year ago

I somehow wonder, if one should create an Abstract Superclass from which both derive? One can also think about a shift array being a PaddedArray of a SubArray (@view) and the CircShiftedArray being a CatViewor a SubArray. What is your opinion?

RainerHeintzmann commented 1 year ago

.. will now try to fix the base-type problem and then also move the shift back to a variable inside the class.

RainerHeintzmann commented 1 year ago

This is done now. The benchmarking yielded only minor changes which are acceptable and at least the combinatorial explosion problem is reduced. Most importantly the type-stability seems restored now.

piever commented 1 year ago

I somehow wonder, if one should create an Abstract Superclass from which both derive?

Yes, this seems like a cleaner solution if one wants to share functionality. The current implementation seems a bit odd with the value of the default element encoded in a type parameter. Having both ShiftedArray and CircShiftedArray be subtypes of a

AbstractShiftedArray{T, N} <: AbstractArray{T, N}

seems a cleaner way to reuse functionality. I would also like to ideally keep the meaning of ShiftedArray and CircShiftedArray the same as they were before to avoid breakage.

One can also think about a shift array being a PaddedArray of a SubArray (@view) and the CircShiftedArray being a CatViewor a SubArray. What is your opinion?

To be honest, I think I would prefer to keep this package very lightweight, so unless there are strong advantages, I would avoiding using combination of other wrapper types.

On a more general note, the PR seems pretty large (and so also a bit difficult to review), is there a way that it could be split into smaller / more manageable PRs? Though maybe switching from the more complex type to the inheritance approach to share functionality would reduce the diff significantly.

RainerHeintzmann commented 1 year ago

I somehow wonder, if one should create an Abstract Superclass from which both derive?

Yes, this seems like a cleaner solution if one wants to share functionality. The current implementation seems a bit odd with the value of the default element encoded in a type parameter. Having both ShiftedArray and CircShiftedArray be subtypes of a
AbstractShiftedArray{T, N} <: AbstractArray{T, N}
seems a cleaner way to reuse functionality. I would also like to ideally keep the meaning of ShiftedArray and CircShiftedArray the same as they were before to avoid breakage. I think there is almost no breakage in the current version. Even the shifts member is maintained. I think using the default value as a type parameter is not bad, as this is something that the compiler would usually know which leads to a speedup similar to StaticArrays.jl. The only possible break is if someone accessed the default argument in the structure, but this should anyway be avoided.

One can also think about a shift array being a PaddedArray of a SubArray (@view) and the CircShiftedArray being a CatViewor a SubArray. What is your opinion?

To be honest, I think I would prefer to keep this package very lightweight, so unless there are strong advantages, I would avoiding using combination of other wrapper types. Fair enough But I also did not think to incklude any other packages. The thought was more about them possibly using the structures created here.

On a more general note, the PR seems pretty large (and so also a bit difficult to review), is there a way that it could be split into smaller / more manageable PRs? Though maybe switching from the more complex type to the inheritance approach to share functionality would reduce the diff significantly.

I doubt it. The size of this PR is due to the details of the broadcasting mechanism, which one would probably need just as well (if not more) if an abstract array was introduced. Don't you think that the AbstractArray intermediate class may be also introduced at a later stage? Should we ask @roflmaostc for help with the code review?

piever commented 1 year ago

Of course it can be helpful if @roflmaostc can review the code, but I think it would greatly simplify things if we tried to break this PR into more manageable chunks, as the diff really is quite large (esp. given how small the package is).

As far as I see it, a few things are happening.

Merging ShiftedArray and CircShiftedArray into a unique type. Is this actually needed? Why it is better than just having a shared supertype to share functionality? Why does the default value need to be encoded as a Val(x) parameter and not simply as a field? Many alternatives are possible and it should be discussed: this is a big change for a stable package.
Adding some optimized methods for shifted arrays (such as isequal, reverse, similar and so on). This can definitely be done separately.
Adding broadcasting mechanisms, which I imagine will be a complex thing regardless of 1. and 2.

I would propose to first figure out 1. I think we should make the smallest possible change to the type hierarchy, as this package is a dependency of the JuliaStats stack.

RainerHeintzmann commented 1 year ago

Merging ShiftedArray and CircShiftedArray into a unique type. Is this actually needed? Why it is better than just having a shared supertype to share functionality?

I guess this could have been done either way. I guess both have a "shift" parameter.

Why does the default value need to be encoded as a Val(x) parameter and not simply as a field? Many alternatives are possible and it should be discussed: this is a big change for a stable package.

If it is part of the type, it can be determined at compile time rather than runtime. This is why I implemented it this way.

Adding some optimized methods for shifted arrays (such as isequal, reverse, similar and so on). This can definitely be done separately.

I agree. this could have been done separately.

Adding broadcasting mechanisms, which I imagine will be a complex thing regardless of 1. and 2.

Yes. This is the main and also the most important point. Without the broadcasting it cannot sensibly be used in CUDA.jl. Most of the effort went into trying to establish close to 100% compatibillity. Yes, the type hirarchy changed, but I think this should not affect its use. Maybe we should run the tests of JuliaStats stack with the modified package?

RainerHeintzmann commented 1 year ago

Can we make some progress on this pull request? Maybe @maleadt can have a look at this to comment? This issue is that we have code, which uses ShiftedArrays.jl and are unable to convert it to compatibility with CUDA.jl without a major rewrite, since the current version of ShiftedArrays.jl is incompatible with CUDA.jl. For this reason I spend quite some time to make it compatible. I do understand, that this converted a beautifully written simple package with only a few lines of code into a pretty extended (relatively messy) package. Yet, luckily you wrote a quite extended test suite, which the version of this PR passes. So if @maleadt or another expert in the field agrees, can we merge it? If you disagree, should I rather rename the package and release it under a different name?

maleadt commented 1 year ago

This issue is that we have code, which uses ShiftedArrays.jl and are unable to convert it to compatibility with CUDA.jl without a major rewrite, since the current version of ShiftedArrays.jl is incompatible with CUDA.jl.

Can you be more specific? I'm not familiar with this package; What used to be CuArray-incompatible, and what changes did you have to make that are specific to CUDA.jl?

RainerHeintzmann commented 1 year ago

@maleadt Sorry for pulling you into this here. In short: This module wraps an array into a shifted or circularly shifted array. Specialities are the support of types such as undefined and padding values and also cancelling of shifts where possible. The FourierTools.lj pa ckaged makes use of it. The basic implementation (see current version in the original repository) works by overloading the index access in get_index. This lead to small and beautiful code. However, if this is used together with CuArray, it caused well justified warnings and severe performance loss, especially when it comes to using such wrapped arrays as members of broadcast operations.

For this reason I implemented a bunch of necessary broadcast overloads for this module such that direct calls to get_index are avoided as much as possible. This made also the core functionality a bit faster, but most importantly it now fully works with CuArray support. The disadvantage is, that it made the code significantly longer and also significantly harder to understand. Yet, this is currently the price to pay.

@piever luckily implemented an quite extensive set of tests, which helped enormously in implementing the broadcasting schemes, which now fully pass these tests (and some more) both with and without CuArray as the input datatype. Interestingly this package does still not depend on CUDA.jl as the implementations only concern the broadcasting.

A (minor?) detail is that the class structure was slightly changed (avoiding two independent classes) to avoid code-dublication, but this should still be compatible with the original classes for all practical purposes.

maleadt commented 1 year ago

The basic implementation (see current version in the original repository) works by overloading the index access in get_index. This lead to small and beautiful code. However, if this is used together with CuArray, it caused well justified warnings and severe performance loss, especially when it comes to using such wrapped arrays as members of broadcast operations.

I guess you're talking about scalar indexing here? Without looking too closely at the source code (where you seem to materialize the array wrapper eagerly before broadcasting); I wonder if it wouldn't have been possible to have broadcast apply the ShiftedArray wrapper to the CuDeviceArray that gets passed to the broadcast kernel, where the existing getindex overrides could just have been re-used (scalar indexing is bad because you index from the host; the GPU itself can of course freely index device arrays during execution on the GPU). Such specializations could then maybe be put in an extension package, instead of changing the existing implementation.

RainerHeintzmann commented 1 year ago

Thanks for this interesting idea. This may also be an interesting approach. Of course you would than not get the speed-improvement that the full broadcasting support in the current implementation supplies for non-CUDA usage. Yet what I also wonder is how well it would be possible to preserve the use of the "undefined" datatype and how one would write a getindex function that would run on the CuDeviceArray. Would one need to write specialized CUDA-Kernels for this? Part of the beauty of the current implementation is that it is not CUDA-specific but just implements broadcasting. Yet I do agree that a small kernel-based getindex access also has benefits and would keep the code more clean.

maleadt commented 1 year ago

Yet what I also wonder is how well it would be possible to preserve the use of the "undefined" datatype and how one would write a getindex function that would run on the CuDeviceArray. Would one need to write specialized CUDA-Kernels for this?

If you pass a CircShiftedArray{<:CuDeviceArray} to the existing broadcast kernel from GPUArrays, getindex from ShiftedArrays.jl would be called, which calls getindex on the underlying array. So you wouldn't need to write any kernel code. The disadvantage of course is that you would be spawning too many threads, i.e., also for the part of the array that's been shifted away (and where I presume getindex from ShiftedArrays.jl returns missing), so materializing the wrapper early and optimizing what you broadcast over might be better for your use cases.

RainerHeintzmann commented 1 year ago

This sounds quite an interesting strategy, but I have to admit that I cannot fully follow as I am not acquainted well enough with the internals of CUDA.jl. If I look at the type of CuArray(rand(100,100)), it does not seem to be a subtype of CuDeviceArray neither are its storage component or the buffer. From the Doc of CuDeviceArray I do not understand whether the ptr is supposed to be a device pointer (e.g. like the one already existing in a CuArray) or a pointer in the main memory, where it would presumably copy the data to the device.
So your suggestion would be to write an extension package, which would have some code specialization to re-wrap a CircShiftedArray{CuArray} into a CircShiftedArray{CuDeviceArray} which would then automatically use the getindex in the kernel-code during any broadcast operation? I am less worried about unused kernels for the empty areas of ShiftedArrays (which will anyway stall and be reused) than about general efficiency of broadcasting and multi-core support, which is a key feature of Julia.

maleadt commented 1 year ago

CuDeviceArray is the device-side representation of a CuArray, and gets automatically converted when passing it to a kernel (via cudaconvert). The only thing that should be required here, is ensuring that CircShiftedArray{CuArray} still uses the GPUArrays.jl/CUDA.jl broadcast kernels (probably overloading BroadcastStyle), and providing an Adapt.jl rule so that cudaconvert(::CircShiftedArray{CuArray}) results in a CircShiftedArray{CuDeviceArray} (e.g. https://github.com/JuliaGPU/Adapt.jl/blob/df06bcb6936baa7352b8cc7bf5f08f98f2653f25/src/wrappers.jl#L10-L11).

RainerHeintzmann commented 1 year ago

Ok. This is actually an impressive scheme, and basically works. However, as I feared, there is a problem with the "non-concrete element type": # ERROR: GPU broadcast resulted in non-concrete element type Union{Missing, Float64}. As far as I remember, some of the complications in my code arose from this which meant, that I had to carry this type information through the broadcasting scheme using a specialized broadcast style. Is there a way to avoid this problem?

using ShiftedArrays
using CUDA
using Adapt
CUDA.allowscalar(false);

a = rand(100,100);
# csa = circshift(a, (10,11))
csa = CircShiftedArray(a, (10,11));
ca = CuArray(a);

csca = CircShiftedArray(ca, (10,11));
Adapt.adapt_structure(to, x::CircShiftedArray{T, D, CT}) where {T,D,CT<:CuArray} = CircShiftedArray(adapt(to, parent(x)), shifts(csca));

q = adapt(CuArray,csca);
q = cudaconvert(csca);
typeof(q)
qc = csca .+ 1.0;

function Base.Broadcast.BroadcastStyle(::Type{T})  where (T<: CircShiftedArray{<:Any,<:Any,<:CuArray})
    CUDA.CuArrayStyle{ndims(T)}()
end

qc = csca .+ 1.0;
q  = csa .+ 1.0;
Array(qc) == q # check for identical result
# Great!

sa = ShiftedArray(a,(22,23))
sca = ShiftedArray(ca,(22,23));

# lets do this for the ShiftedArray type
function Base.Broadcast.BroadcastStyle(::Type{T})  where (T<: ShiftedArray{<:Any,<:Any,<:Any,<:CuArray})
    CUDA.CuArrayStyle{ndims(T)}()
end

q = sa .+ 1.0;ERROR: GPU broadcast resulted in non-concrete element type Union{Missing, Float64}.
# This probably means that 
# Here is the problem:
qc = sca .+ 1.0;
# ERROR: GPU broadcast resulted in non-concrete element type Union{Missing, Float64}.
# This probably means that the function you are broadcasting contains an error or type instability.

maleadt commented 1 year ago

# ERROR: GPU broadcast resulted in non-concrete element type Union{Missing, Float64}.

Eh, that probably just needs to be relaxed. As I mentioned above, CuArrays supports union-isbits storage nowadays. Removing that error as a test, and adding the necessary Adapt rule, makes everything work:

julia> Adapt.adapt_structure(to, x::ShiftedArray{T, M, N, <:CuArray}) where {T,M,N} =
           ShiftedArray(adapt(to, parent(x)), shifts(x); default=ShiftedArrays.default(x))

julia> typeof(cudaconvert(sca))
ShiftedArray{Float64, Missing, 2, CuDeviceMatrix{Float64, 1}}

julia> sca .+ 1.0
100×100 CuArray{Union{Missing, Float64}, 2, CUDA.Mem.DeviceBuffer}:
 missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  …   missing   missing   missing   missing   missing   missing   missing   missing
...

Also:

csca = CircShiftedArray(ca, (10,11));
Adapt.adapt_structure(to, x::CircShiftedArray{T, D, CT}) where {T,D,CT<:CuArray} = CircShiftedArray(adapt(to, parent(x)), shifts(csca));

That csca usage is wrong.

RainerHeintzmann commented 1 year ago

Thanks a lot. Silly me, to forget the second adapt rule. Thanks also for spotting the wrong csca which, of course, should be x. If I get you right, the error needed removal in the main branch. Has this been pushed? I tried it by using "update CUDA" but still get the same error.

maleadt commented 1 year ago

https://github.com/JuliaGPU/GPUArrays.jl/pull/491

RainerHeintzmann commented 1 year ago

So, here comes the surprise. I did some benchmarking using @btime, which I hope is OK to use for CuArrays. While the (complicated) broadcast version of the PR is 2-8 fold faster for normal (non-CuArray) arrays, the cuda version seems to be 5-10 fold faster in the non-special broadcast (simple code with adapt rule) way with adapt rules.

The question is therefore how to proceed from here? @piever Split the versions and make an extra package for the broadcast support, but keep this one simple and add adapt rules (in an extension package)? Or dump the (hard work) for broadcast support and only use the adapt rules, therefore not gaining the speedup? Or try to find a (complicated) way to get the optimal performance for each use-case?

RainerHeintzmann commented 1 year ago

@piever see above. Any comments how to proceed?

piever commented 1 year ago

@RainerHeintzmann sorry for the very slow reply rate in this PR! I can see that a lot of work already went into this, and it'd be definitely very useful to get ShiftedArrays of CuArrays to work well.

So, here comes the surprise. I did some benchmarking using @btime, which I hope is OK to use for CuArrays. While the (complicated) broadcast version of the PR is 2-8 fold faster for normal (non-CuArray) arrays, the cuda version seems to be 5-10 fold faster in the non-special broadcast (simple code with adapt rule) way with adapt rules.

Maybe a trivial comment, but did you make sure to also use CUDA.@sync? See examples here.

The question is therefore how to proceed from here? @piever Split the versions and make an extra package for the broadcast support, but keep this one simple and add adapt rules (in an extension package)? Or dump the (hard work) for broadcast support and only use the adapt rules, therefore not gaining the speedup? Or try to find a (complicated) way to get the optimal performance for each use-case?

Having the adapt rule in the package seems perfectly reasonable. It's just a lightweight dependency + 2 lines of code. I would agree to use the new weakdeps machinery (just how OffsetArrays.jl plans on doing it: https://github.com/JuliaArrays/OffsetArrays.jl/pull/331).

Given that the adapt rule already makes things work on the GPU, it might make sense to have the custom broadcasting code (at least for the time being) as an experimental package that does type piracy (say FastShiftedArraysBroadcast.jl). In the future, once that gets polished, battle-tested and further optimized for the GPU as well, we can consider adding it here.

RainerHeintzmann commented 1 year ago

Thanks for your hints. Indeed I did not consider the CUDA.@sync command. I redid all the benchmarking both ways. Now there is still an advantage of the Adapt rule over the full-broadcast implementation. The relative speedup is between 10% and a factor of two for CUDA. However, for non-cuda is is the other way round: the full broadcast version is consistently about a factor of two faster than the orginal (or AdaptCUDA) version. The broadcast package does not have any additional dependencies, I think, so it is " lightweight dependency". As for battle-testing you may be right. Since it passes your quite extensive testing, it does have some battle-testing already, but not in a real use-case environment. So how do we continue now?

piever commented 1 year ago

So how do we continue now?

I think the first thing to do is to turn the snippet from https://github.com/JuliaArrays/ShiftedArrays.jl/pull/66#issuecomment-1699096724 (and the analogous for CircShiftedArrays) into a PR (feel free to go for it, otherwise I can just do it in the next few days, should be reasonably straightforward) and test that this fixes broadcast on the GPU (using for example https://github.com/JuliaGPU/GPUArrays.jl/tree/master/lib/JLArrays).

I imagine that if you need the extra performance of this PR, you could start developing a FastShiftedArraysBroadcast.jl package (essentially a copy-paste of this PR) that "pirates" the broadcasting methods, and start battle-testing it in your projects and polishing it. Overtime we can figure out whether the code complexity / performance trade-off is worth it.

RainerHeintzmann commented 1 year ago

OK. I did already write that code and I am happy to do a new PR, but only in about two weeks, if that's OK.

piever commented 1 year ago

OK. I did already write that code and I am happy to do a new PR, but only in about two weeks, if that's OK.

Sounds great!

JuliaArrays / ShiftedArrays.jl

Added speed improvements and CuArray support to CircShiftedArray #66