Open ExpandingMan opened 5 years ago
I'm discovering that one must be extremely careful of this issue in the new build
functions. If the function is given an argument that's supposed to specify its eltype
, one must avoid using this argument, sometimes the eltype
s of inner containers is different than claimed!
The pyarrow output for arrays not containing nulls is rather strange. It seems that, by default the pyarrow output schema indicates that all columns are nullable. However, for columns without nulls, instead of outputting a normal bitmask, it outputs zero-length buffers. By this we mean that in the
RecordBatch
, there is aFieldNode
for the column showing that it has zero nulls, and it contains twoBuffer
objects (as expected). The first of these buffer objects, however, instead of describing the (all 1's) bitmask that you'd expect, has zero length. It of course would make sense to elide the bitmask when it's unnecessary, but in that case I'd expect there to be noBuffer
object.I can see the following options for dealing with this
Buffer
has zero length and return an object without a bitmask.FillArray
s of all 1's instead of a normal arrow bitmask.Of these options, 3 seems the worst as it is potentially a huge performance sacrifice. 1 and 2 both have the disadvantage that the container types can no longer be uniquely predicted by the schema, though this issue seems somewhat worse in 1. 2 seems like a more complicated attempt at a solution, which still doesn't really seem like it solves the problem, so I think 1 is the only real option.