optimize get_ith - Githubissues

aplavin commented 11 months ago

Makes getindex and related functions much faster on arrays with a few tens of components.

Narrow arrays:

julia> a1 = StructArray(x=1:10, y=1:10, z=1:10)
julia> @btime $a1[2]
# before
  1.625 ns (0 allocations: 0 bytes)
# after
  1.666 ns (0 allocations: 0 bytes)

Intermediate arrays:

julia> a2 = StructArray(;[Symbol("x$i") => 1:10 for i in 1:50]...)

julia> @btime $a2[2]
# before
  42.166 μs (854 allocations: 45.86 KiB)
# after
  3.182 μs (56 allocations: 4.91 KiB)

I guess originally there was some reason why the current implementation was chosen instead of a more naive one? The latter turns out to be faster now.

aplavin commented 11 months ago

Nightly failures unrelated

N5N3 commented 11 months ago

We just met a Vararg length limitation here. Tuple reducer using Base.tail plays bad for these long inputs as expected.

It should be noticed that the current version gives a stronger promising on inlining. As map has no @inline, and callsite inlining is not supported before 1.8.

There's also some concern on inferability on nested StructArray, though it should be usually ok in this case.

aplavin commented 11 months ago

some concern on inferability on nested StructArray

Could you please suggest a few @inferred tests to add to the testsuite, so that to check for that? Not just for this PR, but for anything that happens in the future as well. Otherwise it's very easy to forget!

the current version gives a stronger promising on inlining

Are you talking about the @inline annotation? Docs say that it doesn't force nor guarantee anything: it's an "extra nudge" as docs phrase it. Also they say small functions typically don't need this annotation at all... map on Tuples is very fundamental in Julia, pretty sure its performance is kept in check.

N5N3 commented 11 months ago

I believe map itself is not designed for handling this kind of long tuples though. It would give up rescusion and fallback to a collect to Vector{Any} algorithm which punishes inference. You need ntuple or generated function to forse unroll in this case.

As for inlining, the current code would has more possibility to be inlined. In fact, getindex is not that cheap, especially for AbstractArray. So I think we'd better remove all inline blocker here.

aplavin commented 11 months ago

Indeed, sounds like ntuple() should bring the best from both worlds here :) Will update the PR, always forget about that function.

And would be really great if you could suggest more inference or related tests for StructArrays, so that we could catch (some) of these potential issues automatically in the future.

N5N3 commented 11 months ago

ntuple is also a inline blocker here. I think a StructArray of SMatrix{4,4} (or bigger) is enough to see the difference.

In fact, the limitation comes from Base.argtail. So the simpliest fix here is defining tail ourselves.

As for inference, I believe a nested StructArray with 4 or 5 layers is enough to check stability. Though I wound if this has a realistic usage (just like the big struct in MWE). IIUC , Array should be preferred for these cases.

piever commented 11 months ago

I also seem to remember we switched to this version as a map-based (or rather ntuple-based) get_ith caused problem in CUDA kernels (see #114 and #87). We should probably add tests that StructArrays work in CUDA kernels, but I'm unsure what is a way to do that without complicating CI (I don't think the standard GitHub CI has access to a GPU).

aplavin commented 11 months ago

Looks like lots of stuff is interconnected here! Wide, deep, GPU arrays... Maybe just do @generated get_ith then, should always be no-overhead and work with any GPU type?

Pinging @vchuravy @lcw (those from the PR you linked): do you guys have any suggestions?

aplavin commented 11 months ago

Pushed a straightforward @generated implementation: seems to solve all mentioned issues!

Don't know how to add something relevant to tests, but here are benchmarks:

A = StructArray([SMatrix{6,6}(rand(6,6)) for _ in 1:10])
@btime $A[5]
# before:
# 6.948 μs (155 allocations: 5.36 KiB)
# after:
# 1.021 μs (39 allocations: 1.45 KiB)

B = StructArray(a=StructArray(b=StructArray(c=StructArray(d=StructArray(e=StructArray(f=1:10))))))
@btime $B[5]
# before:
# 1.500 ns (0 allocations: 0 bytes)
# after:
# 1.500 ns (0 allocations: 0 bytes)

C = StructArray(;[Symbol("x$i") => 1:10 for i in 1:50]...)
@btime $C[5]
# before:
# 42.375 μs (854 allocations: 45.86 KiB)
# after:
# 1.092 μs (3 allocations: 1.31 KiB)

N5N3 commented 11 months ago

So there are still remaining allocations? Perhaps it's another similar problem caused by splatted Tuple. I guess you need to fix that too to make your case fully optimized.

If we design to optimize this package for eltype with big struct, it would be good to ensure it works in more cases. Just for example, apparently inference would be blocked by map in _getindex below.

aplavin commented 11 months ago

That's a bit above my head – tried to specialize more functions involved, but it didn't help. Suggestions welcome!

And this PR is a strict improvement anyway, with more than an order of magnitude speedup. This can easily turn the overhead from "dominates the profview" to "barely visible there".

lcw commented 11 months ago

I haven't tested it but the new @generated version of the code looks great. It should work on the GPU as well.

piever commented 11 months ago

I also like the generated function approach, it does seem very simple and elegant. I've left a couple minor suggestions, but overall looks good to me.

aplavin commented 10 months ago

Followed those comments, should be ready! Still, not sure if something relevant is possible to put into tests, but that would be great.

aplavin commented 10 months ago

gentle bump...

aplavin commented 9 months ago

bump

aplavin commented 8 months ago

I'll take the liberty to merge this, following an advice on slack #arrays. The PR didn't get a strong "no" from maintainers, changes are nonbreaking, and quite local (ie not an overhaul). Please let me know if anything is wrong, it's the first time I do this in JuliArrays.

JuliaArrays / StructArrays.jl

optimize get_ith #286