Closed aplavin closed 8 months ago
Nightly failures unrelated
We just met a Vararg length limitation here. Tuple reducer using Base.tail
plays bad for these long inputs as expected.
It should be noticed that the current version gives a stronger promising on inlining. As map
has no @inline
, and callsite inlining is not supported before 1.8.
There's also some concern on inferability on nested StructArray
, though it should be usually ok in this case.
some concern on inferability on nested StructArray
Could you please suggest a few @inferred
tests to add to the testsuite, so that to check for that? Not just for this PR, but for anything that happens in the future as well. Otherwise it's very easy to forget!
the current version gives a stronger promising on inlining
Are you talking about the @inline
annotation? Docs say that it doesn't force nor guarantee anything: it's an "extra nudge" as docs phrase it. Also they say small functions typically don't need this annotation at all...
map
on Tuples is very fundamental in Julia, pretty sure its performance is kept in check.
I believe map itself is not designed for handling this kind of long tuples though. It would give up rescusion and fallback to a collect to Vector{Any}
algorithm which punishes inference. You need ntuple
or generated function to forse unroll in this case.
As for inlining, the current code would has more possibility to be inlined. In fact, getindex
is not that cheap, especially for AbstractArray
. So I think we'd better remove all inline blocker here.
Indeed, sounds like ntuple()
should bring the best from both worlds here :) Will update the PR, always forget about that function.
And would be really great if you could suggest more inference or related tests for StructArrays, so that we could catch (some) of these potential issues automatically in the future.
ntuple
is also a inline blocker here. I think a StructArray
of SMatrix{4,4}
(or bigger) is enough to see the difference.
In fact, the limitation comes from Base.argtail
. So the simpliest fix here is defining tail
ourselves.
As for inference, I believe a nested StructArray
with 4 or 5 layers is enough to check stability. Though I wound if this has a realistic usage (just like the big struct in MWE). IIUC
, Array
should be preferred for these cases.
I also seem to remember we switched to this version as a map
-based (or rather ntuple
-based) get_ith
caused problem in CUDA kernels (see #114 and #87). We should probably add tests that StructArrays work in CUDA kernels, but I'm unsure what is a way to do that without complicating CI (I don't think the standard GitHub CI has access to a GPU).
Looks like lots of stuff is interconnected here! Wide, deep, GPU arrays...
Maybe just do @generated get_ith
then, should always be no-overhead and work with any GPU type?
Pinging @vchuravy @lcw (those from the PR you linked): do you guys have any suggestions?
Pushed a straightforward @generated
implementation: seems to solve all mentioned issues!
Don't know how to add something relevant to tests, but here are benchmarks:
A = StructArray([SMatrix{6,6}(rand(6,6)) for _ in 1:10])
@btime $A[5]
# before:
# 6.948 μs (155 allocations: 5.36 KiB)
# after:
# 1.021 μs (39 allocations: 1.45 KiB)
B = StructArray(a=StructArray(b=StructArray(c=StructArray(d=StructArray(e=StructArray(f=1:10))))))
@btime $B[5]
# before:
# 1.500 ns (0 allocations: 0 bytes)
# after:
# 1.500 ns (0 allocations: 0 bytes)
C = StructArray(;[Symbol("x$i") => 1:10 for i in 1:50]...)
@btime $C[5]
# before:
# 42.375 μs (854 allocations: 45.86 KiB)
# after:
# 1.092 μs (3 allocations: 1.31 KiB)
So there are still remaining allocations? Perhaps it's another similar problem caused by splatted Tuple. I guess you need to fix that too to make your case fully optimized.
If we design to optimize this package for eltype with big struct, it would be good to ensure it works in more cases. Just for example, apparently inference would be blocked by map
in _getindex
below.
That's a bit above my head – tried to specialize more functions involved, but it didn't help. Suggestions welcome!
And this PR is a strict improvement anyway, with more than an order of magnitude speedup. This can easily turn the overhead from "dominates the profview" to "barely visible there".
I haven't tested it but the new @generated
version of the code looks great. It should work on the GPU as well.
I also like the generated function approach, it does seem very simple and elegant. I've left a couple minor suggestions, but overall looks good to me.
Followed those comments, should be ready! Still, not sure if something relevant is possible to put into tests, but that would be great.
gentle bump...
bump
I'll take the liberty to merge this, following an advice on slack #arrays. The PR didn't get a strong "no" from maintainers, changes are nonbreaking, and quite local (ie not an overhaul). Please let me know if anything is wrong, it's the first time I do this in JuliArrays.
Makes getindex and related functions much faster on arrays with a few tens of components.
Narrow arrays:
Intermediate arrays:
I guess originally there was some reason why the current implementation was chosen instead of a more naive one? The latter turns out to be faster now.