ExpandingMan / Arrow.jl

DEPRECATED in favor of [JuliaData/Arrow.jl](https://github.com/JuliaData/Arrow.jl)
Other
56 stars 9 forks source link

dealing with strange conventions for writing nullable arrays with no nulls #43

Open ExpandingMan opened 5 years ago

ExpandingMan commented 5 years ago

The pyarrow output for arrays not containing nulls is rather strange. It seems that, by default the pyarrow output schema indicates that all columns are nullable. However, for columns without nulls, instead of outputting a normal bitmask, it outputs zero-length buffers. By this we mean that in the RecordBatch, there is a FieldNode for the column showing that it has zero nulls, and it contains two Buffer objects (as expected). The first of these buffer objects, however, instead of describing the (all 1's) bitmask that you'd expect, has zero length. It of course would make sense to elide the bitmask when it's unnecessary, but in that case I'd expect there to be no Buffer object.

I can see the following options for dealing with this

  1. Detect that the Buffer has zero length and return an object without a bitmask.
  2. Promote the nullable objects to optionally hold FillArrays of all 1's instead of a normal arrow bitmask.
  3. Allocate a new arrow formatted bitamsk in Julia.

Of these options, 3 seems the worst as it is potentially a huge performance sacrifice. 1 and 2 both have the disadvantage that the container types can no longer be uniquely predicted by the schema, though this issue seems somewhat worse in 1. 2 seems like a more complicated attempt at a solution, which still doesn't really seem like it solves the problem, so I think 1 is the only real option.

ExpandingMan commented 5 years ago

I'm discovering that one must be extremely careful of this issue in the new build functions. If the function is given an argument that's supposed to specify its eltype, one must avoid using this argument, sometimes the eltypes of inner containers is different than claimed!