JuliaData / Feather.jl

Read and write feather files in pure Julia
https://juliadata.github.io/Feather.jl/stable
Other
109 stars 27 forks source link

Handling categorical values from Arrow #110

Open bkamins opened 5 years ago

bkamins commented 5 years ago

Given the way Arrow treats nominal variables maybe it would be cleaner that we read them in as PooledArray not CategoricalArray because they are essentially a PooledArray and recently we are considering adding more support for this type in DataFrames.jl.

CC @nalimilan

nalimilan commented 5 years ago

Good question. Looking at the docs, it seems that levels in what Arrow calls a "dictionary encoded" column can appear in an arbitrary order, which we could consider as significant or not. The answer to that question should determine whether to return a CategoricalArray (order is meaningful) or a PooledArray (order is an implementation detail).

I guess a good way to asses this is to see whether saving a factor from R and loading it again preserves the custom order of levels. I think this also applies to Pandas.

bkamins commented 5 years ago

You can check in Julia that saving CategoricalArray using Feather.jl and loading it back retains all levels (even if they are not present in the vector - it is enough that they are present in levels) but does not keep their order.