JuliaHEP / AwkwardArray.jl

Awkward Array in Julia mirrors the Python library, enabling effortless zero-copy data exchange between Julia and Python
https://juliahep.github.io/AwkwardArray.jl/dev/
MIT License
31 stars 2 forks source link

Handling mixed empty structs #75

Open Moelf opened 6 months ago

Moelf commented 6 months ago
julia> using AwkwardArray: ListOffsetArray as LOA

julia> using AwkwardArray: RecordArray as RA

julia> using AwkwardArray: PrimitiveArray as PA

julia> s = RA((; a=PA([1,2,3]), b=PA([4,5,6])))
3-element AwkwardArray.RecordArray{(:a, :b), Tuple{AwkwardArray.PrimitiveArray{Int64, Vector{Int64}, :default}, AwkwardArray.PrimitiveArray{Int64, Vector{Int64}, :default}}, :default}:
 {a: 1, b: 4}
 {a: 2, b: 5}
 {a: 3, b: 6}

julia>

julia> aux = LOA([0,2,2,3], PA(Dummy[]))
3-element AwkwardArray.ListOffsetArray{Vector{Int64}, AwkwardArray.PrimitiveArray{Dummy, Vector{Dummy}, :default}, :default}:
 #undef
    0-element AwkwardArray.PrimitiveArray{Dummy, Vector{Dummy}, :default}
 #undef

julia> s = RA((; a=PA([1,2,3]), b=PA([4,5,6]), c=aux))
3-element AwkwardArray.RecordArray{(:a, :b, :c), Tuple{AwkwardArray.PrimitiveArray{Int64, Vector{Int64}, :default}, AwkwardArray.PrimitiveArray{Int64, Vector{Int64}, :default}, AwkwardArray.ListOffsetArray{Vector{Int64}, AwkwardArray.PrimitiveArray{Dummy, Vector{Dummy}, :default}, :default}}, :default}:
Error showing value of type AwkwardArray.RecordArray{(:a, :b, :c), Tuple{AwkwardArray.PrimitiveArray{Int64, Vector{Int64}, :default}, AwkwardArray.PrimitiveArray{Int64, Vector{Int64}, :default}, AwkwardArray.ListOffsetArray{Vector{Int64}, AwkwardArray.PrimitiveArray{Dummy, Vector{Dummy}, :default}, :default}}, :default}:
ERROR: BoundsError: attempt to access 0-element Vector{Dummy} at index [1:2]
Stacktrace:
  [1] throw_boundserror(A::Vector{Dummy}, I::Tuple{UnitRange{Int64}})

Should this work? My personal belief is that this is garbage -- at any point you can't have completely empty stuff, but I do get handed this in RNTuple https://gist.github.com/Moelf/1c9bf1d3ea176c0958605afcaa9c606a:

├─ :TruthBottom ⇒ Vector
│                 ├─ :offset ⇒ Leaf{UnROOT.Index64}(col=40)
│                 └─ :content ⇒ Struct
│                               └─ Symbol(":_0") ⇒ Struct
│                                                  └─ Symbol(":_0") ⇒ Struct
│                                                                     └─ Symbol(":_0") ⇒ Struct
jpivarski commented 6 months ago

This aux is bad, but in a way that should only be checked in a validity-checking pass, not in every constructor (or else there would be a lot of redundant validity checks):

>>> from awkward.contents import ListOffsetArray as LOA, RecordArray as RA, NumpyArray as PA
>>> from awkward.index import Index as I
>>> from numpy import array as A
>>>
>>> s = RA([PA(A([1, 2, 3])), PA(A([4, 5, 6]))], ["a", "b"], 3)
>>> s
<RecordArray is_tuple='false' len='3'>
    <content index='0' field='a'>
        <NumpyArray dtype='int64' len='3'>[1 2 3]</NumpyArray>
    </content>
    <content index='1' field='b'>
        <NumpyArray dtype='int64' len='3'>[4 5 6]</NumpyArray>
    </content>
</RecordArray>
>>> print(ak.validity_error(s))

>>> aux = LOA(I(A([0, 2, 2, 3])), PA(A([])))
>>> aux
<ListOffsetArray len='3'>
    <offsets><Index dtype='int64' len='4'>[0 2 2 3]</Index></offsets>
    <content><NumpyArray dtype='float64' len='0'>[]</NumpyArray></content>
</ListOffsetArray>
>>> print(ak.validity_error(aux))
at highlevel ("<class 'awkward.contents.listoffsetarray.ListOffsetArray'>"): stop[i] > len(content) at i=0 (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-29/awkward-cpp/src/cpu-kernels/awkward_ListArray_validity.cpp#L24)

>>> s = RA([PA(A([1, 2, 3])), PA(A([4, 5, 6])), aux], ["a", "b", "c"], 3)
>>> s
<RecordArray is_tuple='false' len='3'>
    <content index='0' field='a'>
        <NumpyArray dtype='int64' len='3'>[1 2 3]</NumpyArray>
    </content>
    <content index='1' field='b'>
        <NumpyArray dtype='int64' len='3'>[4 5 6]</NumpyArray>
    </content>
    <content index='2' field='c'>
        <ListOffsetArray len='3'>
            <offsets><Index dtype='int64' len='4'>[0 2 2 3]</Index></offsets>
            <content><NumpyArray dtype='float64' len='0'>[]</NumpyArray></content>
        </ListOffsetArray>
    </content>
</RecordArray>
>>> print(ak.validity_error(s))
at highlevel.field(2) ("<class 'awkward.contents.listoffsetarray.ListOffsetArray'>"): stop[i] > len(content) at i=0 (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-29/awkward-cpp/src/cpu-kernels/awkward_ListArray_validity.cpp#L24)

The aux is bad because a value in the ListOffsetArray's offsets requires the length of its content to be at least 3, but it's zero. (What's data type Dummy? For portability, the type of a PrimitiveArray must be bytewise-boolean, numerical, and if it's implemented, datetime or time differences.)


For the RNTuple construct, this would do:

>>> a = LOA(I(A([0, 3, 5])), RA([RA([RA([RA([], [], 5)], [":_0"], 5)], [":_0"], 5)], [":_0"], 5))
>>> a
<ListOffsetArray len='2'>
    <offsets><Index dtype='int64' len='3'>[0 3 5]</Index></offsets>
    <content><RecordArray is_tuple='false' len='5'>
        <content index='0' field=':_0'>
            <RecordArray is_tuple='false' len='5'>
                <content index='0' field=':_0'>
                    <RecordArray is_tuple='false' len='5'>
                        <content index='0' field=':_0'>
                            <RecordArray is_tuple='false' len='5'>
                            </RecordArray>
                        </content>
                    </RecordArray>
                </content>
            </RecordArray>
        </content>
    </RecordArray></content>
</ListOffsetArray>
>>>
>>> ak.Array(a).show(type=True)
type: 2 * var * {
    ":_0": {
        ":_0": {
            ":_0": {

            }
        }
    }
}
[[{':_0': {':_0': {':_0': {}}}}, {...}, {':_0': {':_0': {...}}}],
 [{':_0': {':_0': {':_0': {}}}}, {':_0': {':_0': {...}}}]]

In the above, I assumed that

  1. It's a given that TruthBottom has length 3 (comes from some "number of entries" field elsewhere in the RNTuple).
  2. The offset observed in the leaf is [0, 3, 5], from which it can be deduced that the next step in recursion is looking for data of length 5. (From offsets, you can take the last one. If it's starts and stops, it's the maximum stops[i] for which starts[i] != stops[i].)
  3. Recursing into the readers, looking for structs, I'm expecting something of length 5. Each RecordArray is constructed with 5 as its explicit length, and all contents are asserted to be at least this length.
  4. When we get down to a leaf-node of RecordArrays and find a RecordArray with no fields, then it's just RA([], [], 5).

Because RecordArrays can have no fields (unlike UnionArrays), the full set of potential leaves for an Awkward Array tree is: {PrimitiveArray, EmptyArray, RecordArray}. But a RecordArray with no fields is a niche case.

The fact that RecordArrays can have no fields and RegularArrays can have regular size=0 are the reasons why these node types need to have an explicit field for "length" (BitMaskedArray does too, but not in Julia because we can use BitVector), and it's why from_buffers has to be given a length and recurse down with it.

This is when we first realized that in Awkward Array: https://github.com/scikit-hep/awkward/pull/592#issuecomment-743430896.