JuliaData / DataFrames.jl

In-memory tabular data in Julia
https://dataframes.juliadata.org/stable/
Other
1.71k stars 360 forks source link

feature request: allow `skipmissing` column types #3398

Open adienes opened 7 months ago

adienes commented 7 months ago

I understand the rationale for the very elaborate missing logic in that it forces the user to be explicit about how to handle missing values and potentially avoids sneaky statistical bugs

however

for "quick and dirty" tasks just trying to make sense of some data, it quickly becomes cumbersome to constantly be wrapping things in skipmissing or dropmissing etc. etc.

I would love some way to tag columns (or the whole table) as skipmissing 'ed so that all future transformations will automatically insert a skipmissing. maybe like transform(df, All() .=> skipmissing) or skipmissing!(df) or such

bkamins commented 7 months ago

I understand your concern and share it. There is a wide difference between "production code" and "data discovery" workflows.

What you ask for is doable already with metadata. However, I thnk a better solution is rather to have a set of functions that provide an alternative set of behaviors. This is what https://sl-solution.github.io/InMemoryDatasets.jl/stable/man/missing/#Functions-which-skip-missing-values does. The question is, though, how to get a common agreement how to approach it in terms of package ecosystem.

mkitti commented 7 months ago

From https://discourse.julialang.org/t/why-are-missing-values-not-ignored-by-default/106756/115?u=mkitti , it does not appear that hard to do. I'm not clear if this should be part of DataFrames.jl though.

julia> using CSV, DataFrames, Statistics

julia> struct SkipMissingDataFrame
           parent::DataFrame
       end

julia> Base.parent(smdf::SkipMissingDataFrame) = getfield(smdf, :parent)

julia> Base.getproperty(smdf::SkipMissingDataFrame, sym::Symbol) = skipmissing(Base.getproperty(parent(smdf), sym))

julia> write("blah.csv","""
       "col1", "col2"
       "5", "6"
       "1", "2"
       "30", "31"
       "22", "23"
       "NA"
       "50"
       """)
65

julia> df = CSV.read("blah.csv", DataFrame; silencewarnings=true);
julia> smdf = SkipMissingDataFrame(df)
SkipMissingDataFrame(6×2 DataFrame
 Row │ col1     col2    
     │ String3  Int64?  
─────┼──────────────────
   1 │ 5              6
   2 │ 1              2
   3 │ 30            31
   4 │ 22            23
   5 │ NA       missing 
   6 │ 50       missing )

julia> smdf.col2 |> mean
15.5

julia> smdf.col2 |> x->Iterators.filter(>(10),x) |> mean
27.0
nalimilan commented 7 months ago

This example is indeed simple, but as soon as you want to support operations on data frames, you have to reimplement all of the DataFrames.jl API. It's doable but quite some code.

This also creates new issues: df.col3 = 2 .* df.col2 wouldn't work anymore.

I tend to think that this would be better handled with improved macros in DataFramesMeta.

See also https://github.com/JuliaData/DataFrames.jl/issues/2314.