JuliaData / Feather.jl

Read and write feather files in pure Julia
https://juliadata.github.io/Feather.jl/stable
Other
109 stars 27 forks source link

what's going on with dictionary encoding (categorical arrays)? #77

Closed ExpandingMan closed 6 years ago

ExpandingMan commented 6 years ago

The metadata for dictionary encoded data seems all screwed up to me, and I'm starting to think that this is a problem with the feather format itself.

As things are, missing values are currently dealt with by having a -1 as a reference. The problem with this is that the Metadata.PrimitiveArray object that describes it has null_count > 0 despite the fact that there is no null bitmask. This is inconsistent. This would be ok if there were something else indicating what is going on, like for instance if encoding is set to DICTIONARY, but currently this doesn't happen. In fact currently it looks like encoding is never DICTIONARY in any case. Something's got to give. I still think I'm missing something about what goes on in this case, but regardless the metadata seems confusing and inconsistent.

ExpandingMan commented 6 years ago

I think I'm wrong and there is a bit mask. This doesn't seem consistent with the Arrow standard. To be consistent with Arrow, wouldn't one have to have the references contain no nulls, but the array they reference contain nulls?

ExpandingMan commented 6 years ago

Still confused by this, but it doesn't seem actionable within Feather.jl.