apache / arrow-julia

Official Julia implementation of Apache Arrow
https://arrow.apache.org/julia/
Other
284 stars 59 forks source link

(de)serialization behavior of `missing`/`nothing` #258

Open jrevels opened 3 years ago

jrevels commented 3 years ago

In Julia, there is (generally) a useful/meaningful semantic distinction between nothing and missing. IIUC, Arrow doesn't really have equivalent values that capture this distinction, but instead has null which might be used for either. This results in a bit of an impedance mismatch for us to resolve when (de)serializing nothing/missing data.

The current behavior feels like it "resolves" the impedance mismatch just by tossing this information altogether and normalizing to a single value, but the value it chooses to normalize to feels weird to me:

julia> Arrow.Table(Arrow.tobuffer((x = [missing, missing],))).x
2-element Arrow.NullVector{Missing}:
 missing
 missing

julia> Arrow.Table(Arrow.tobuffer((x = [nothing, nothing],))).x
2-element Arrow.NullVector{Nothing}:
 nothing
 nothing

julia> Arrow.Table(Arrow.tobuffer((x = [nothing, missing],))).x
2-element Arrow.NullVector{Nothing}:
 nothing
 nothing

 julia> Arrow.Table(Arrow.tobuffer((x = Any[nothing, missing],))).x
2-element Arrow.NullVector{Missing}:
 missing
 missing

It seems to me like Arrow.jl should either:

  1. find some way to consistently preserve this distinction in all cases when (de)serializing Julia data (e.g. so that [nothing, missing] would roundtrip as [nothing, missing])
  2. lean all-in on dropping the distinction, and force callers to pick what they want to interpret incoming Arrow nulls (e.g. nothing or missing) at read time.
ararslan commented 3 years ago

FWIW option 2 is the approach JSON.jl takes with the null keyword argument to parse/parsefile. It has a default value of nothing but you can pass null=missing. This seems like a reasonable approach to me for Arrow to take.