JuliaData / DataFrames.jl

In-memory tabular data in Julia
https://dataframes.juliadata.org/stable/
Other
1.71k stars 360 forks source link

`join` should not introduce `Missing` types to schema #3431

Open adienes opened 3 months ago

adienes commented 3 months ago
julia> using DataFrames

julia> df1 = DataFrame([:a => [1,2,3], :b=>[4, 5, 6]])
3×2 DataFrame
 Row │ a      b     
     │ Int64  Int64 
─────┼──────────────
   1 │     1      4
   2 │     2      5
   3 │     3      6

julia> df2 = DataFrame([:a => [1,2,3], :c=>[7, 8, 9]])
3×2 DataFrame
 Row │ a      c     
     │ Int64  Int64 
─────┼──────────────
   1 │     1      7
   2 │     2      8
   3 │     3      9

julia> leftjoin(df1, df2; on=:a)
3×3 DataFrame
 Row │ a      b      c      
     │ Int64  Int64  Int64? 
─────┼──────────────────────
   1 │     1      4       7
   2 │     2      5       8
   3 │     3      6       9

there are no missing values after the join so it is quite unfortunate that the type of c in the resulting table is a union with Missing

bkamins commented 3 months ago

It is an union, because there could be missings in :c if df2 did not have all keys (which cannot be checked upfront). You can use disallowmissing! after the join.