JuliaData / DataTables.jl

(DEPRECATED) A rewrite of DataFrames.jl based on Nullable
Other
29 stars 11 forks source link

Start replacing Nullable with Null #62

Open nalimilan opened 7 years ago

nalimilan commented 7 years ago

@quinnj has just created a Null.jl package providing a new Null type to replace DataArrays' NAtype. Even if the Julia compiler doesn't yet include the necessary optimizations to handle Union{T, Null} efficiently (see e.g. discussion at https://github.com/JuliaData/Nulls.jl/issues/3), I think we should start moving away from Nullable now, so that at least we can stabilize the API even if performance remains poor for some time.

NullableArray can be replaced with Array{Union{T, Null}}, which Jameson said will eventually use the same memory layout as NullableArray. This should suit quite well with @cjprybol's PR https://github.com/JuliaData/DataTables.jl/pull/53 which is going to stop auto-promoting columns to NullableArray. CategoricalArray and NullableCategoricalArray will have to be adapted, but that shouldn't be too hard.

ararslan commented 7 years ago

Why not just do this in DataFrames/DataArrays, since the approach there is already the closest to how Nulls.jl works?

nalimilan commented 7 years ago

Because storing columns as Array{Union{T, Null}} is going to be quite slow until (at least) Julia 1.0, and because AFAIK we don't want to continue using Nullable in the future. So better keep DataFrames usable for now (maybe porting DataArrays to Null, but keeping them for efficient memory layout) and apply breaking changes to DataTables, which are still in an experimental state. After Julia 1.0 we should be able to make DataFrames and DataTables converge to a common representation.

davidanthoff commented 7 years ago

I would much prefer if we could keep DataTables.jl as the place for a container based approach to missing values. At this point it is not clear whether the Union{T,Null} approach can work for the whole data ecosystem (e.g. Query.jl) and I don't think we should start to convert anything in this repo here until that is sorted out.

Why not do the Nulls.jl work in a branch in DataFrames.jl?