JuliaData / DataFrames.jl

In-memory tabular data in Julia
https://dataframes.juliadata.org/stable/
Other
1.74k stars 367 forks source link

Working with Nullable DataFrames #1148

Closed madeleineudell closed 7 years ago

madeleineudell commented 7 years ago

I'm maintaining a package that depends on DataFrames, and I'm a bit at a loss as to how to maintain the package at this point with the addition of the Nullable type.

My code uses plenty of arithmetic, indexing, type checking, etc etc. Now none of this works because most functions on Nullables don't propagate to the underlying value even if that instance of the nullable is not null. So addition doesn't work; subtraction doesn't work; type checking for integers or reals doesn't work; etc.

I could rewrite all these functions but that's a huge amount of work, and it's not clear to me from reading these issues whether the API is stable yet anyway. And I certainly won't be able to maintain backward compatibility, which saddens me.

What's your recommendation for package maintainers who depend on DataFrames?

nalimilan commented 7 years ago

See also https://github.com/JuliaStats/DataFrames.jl/issues/1092. It really depends on what the package does. Where possible, the plan is to use high-level APIs provided by StructuredQueries (not ready yet). Then, packages working on numeric data should just convert NullableArray to Array, or similarly call get on Nullable, since they cannot accept missing values anyway. Other cases will need special handling but it's hard to say anything without more details.

datnamer commented 7 years ago

Doesn't @davidgold have an autolifting scheme in the works?

nalimilan commented 7 years ago

Automatic lifting will only be enabled inside query macros. OTOH element-wise operators are now lifting in Julia 0.6.

datnamer commented 7 years ago

What about broadcasting of elwise?

nalimilan commented 7 years ago

What do you mean?

nalimilan commented 7 years ago

@madeleineudell Can you tell us more about the issues you experience?

madeleineudell commented 7 years ago

Sure; when I go to index elements of the DataFrame x = df[3,4], the element x is of type Nullable{whatever}. In the previous incarnation of DataFrames, x was of type whatever, and so methods designed for whatevers worked on x. For example, I have a bunch of code in which whatever is an Int or a Float64 or a Bool, and so I merrily take x and multiply, divide, exponentiate etc. Now, I'd need to have my code use x.value rather than x.

This wouldn't be a problem (other than a coding pain, because I can't just use a simple find-replace), except that my code is also designed to work with Matrices as well as with DataFrames. So that means that everywhere in my code I'd need to sprinkle if isa(x, Nullable) ..., or I'd need to define a new method for every function (+,-,exp,...) whose input might be nullable, etc. In other words, it's pretty annoying if DataFrames and Matrices are no longer interoperable.

If you overloaded all the functions on integers so that +(x::Nullable{Int}, y::Int) = +(x.value, y), for every operation, and ditto for other argument types (Float64, Bool, etc), then that would go some way towards fixing this. But I don't think every package maintainer who uses DataFrames should have to write all those macros. (And in my case, it would take me quite some time to figure out how to do so.)

On Wed, Jan 25, 2017 at 12:57 AM, Milan Bouchet-Valat < notifications@github.com> wrote:

@madeleineudell https://github.com/madeleineudell Can you tell us more about the issues you experience?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JuliaStats/DataFrames.jl/issues/1148#issuecomment-275052903, or mute the thread https://github.com/notifications/unsubscribe-auth/AAyp9O8aNN0iwG6v85Fnyx6KPH3oGtGCks5rVw5egaJpZM4LkLWa .

-- Madeleine Udell Assistant Professor, Operations Research and Information Engineering Cornell University https://people.orie.cornell.edu/mru8/ (415) 729-4115

nalimilan commented 7 years ago

Honestly, I don't think data frames were ever considered as equivalent to matrices (@johnmyleswhite may want to comment); they are more like databases That said, NullableArray currently defines standard operators on Nullable, and Julia 0.6 supports element-wise versions for lifting, even when mixing nullable and non-nullable arguments, e.g. Nullable(1) .+ 1 -> Nullable(2). Would that suit your needs?

nalimilan commented 7 years ago

By the way, what package are we talking about? That would certainly help me to understand your requirements.

madeleineudell commented 7 years ago

I'm talking about LowRankModels in particular. We do PCA, sparse PCA, nonnegative matrix factorization, one-bit PCA etc on both fully observed matrices and partially observed tables of data.

I think that element-wise (automatic or accessible by using a single additional package) lifting would indeed suit my needs, but I'd need to check...!

On Wed, Jan 25, 2017 at 1:49 PM, Milan Bouchet-Valat < notifications@github.com> wrote:

By the way, what package are we talking about? That would certainly help me to understand your requirements.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JuliaStats/DataFrames.jl/issues/1148#issuecomment-275243442, or mute the thread https://github.com/notifications/unsubscribe-auth/AAyp9Mk5kNqSbBSydu5HHjcQ9cX1OPzJks5rV8NRgaJpZM4LkLWa .

-- Madeleine Udell Assistant Professor, Operations Research and Information Engineering Cornell University https://people.orie.cornell.edu/mru8/ (415) 729-4115

nalimilan commented 7 years ago

OK. I would have thought you wouldn't actually have to work with data frames: using StatsModels to transform a dataframe+formula to a matrix, and then work only with standard matrices. Is that possible?

madeleineudell commented 7 years ago

I think it should be possible to use StatsModels, but it requires a complete rethinking of the way modeling works in LowRankModels.

On Wed, Jan 25, 2017 at 2:01 PM, Milan Bouchet-Valat < notifications@github.com> wrote:

OK. I would have thought you wouldn't actually have to work with data frames: using StatsModels to transform a dataframe+formula to a matrix, and then work only with standard matrices. Is that possible?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JuliaStats/DataFrames.jl/issues/1148#issuecomment-275246572, or mute the thread https://github.com/notifications/unsubscribe-auth/AAyp9KGlGh-OCziyE3lrr3Z5kPX653Tpks5rV8YugaJpZM4LkLWa .

-- Madeleine Udell Assistant Professor, Operations Research and Information Engineering Cornell University https://people.orie.cornell.edu/mru8/ (415) 729-4115

kleinschmidt commented 7 years ago

It might be as simple as adding separate constructors for a DataFrame that converts it to a matrix via ModelMatrix(ModelFrame(d, f)) (with an appropriately constructed formula).

Alternatively, is it possible to convert(Matrix, d)?

kleinschmidt commented 7 years ago

Or, use the StatisticalModel type from StatsBase (which will be moved to StatsModels soon, once someone finds the time), and then you get methods for dataframes + formulas for free (via the DataFrameStatisticalModel type, https://github.com/JuliaStats/StatsModels.jl/blob/master/src/statsmodel.jl). The only pain there is constructing an appropriate formula.

kleinschmidt commented 7 years ago

Sorry, I see now that you're doing something more interesting than just converting the dataframe with numeric columns to a matrix. I still think that it makes sense to use ModelMatrix. In fact, the goal of StatsModels.jl is to provide tools for converting tables with a mix of categorical/numerical/etc. data into matrices for modeling, so yours is exactly the kind of use case we're aiming for.

datnamer commented 7 years ago

What is the cost of converting a very large dataframe or database? Won't this be prohibitive sometimes vs operating in place?

kleinschmidt commented 7 years ago

Sure, but as far as I can tell LowRankModels already assumes things will fit in memory (but I could be wrong). One of the goals is to generalize the current implementation to work with a more general interface that doesn't assume things are in-memory but can work with, say, chunks of a table or an iterator of row tuples.

ExpandingMan commented 7 years ago

I just wanted to throw my comments in here as someone else who has been using the current master and NullableArrays extensively, but hasn't written a single line of code for DataFrames.jl itself.

From my experience so far, DataFrames with the NullableArrays back-end needs 3 major quality-of-life improvements before developers of other packages can be reasonably expected to use it:

nalimilan commented 7 years ago

Please, let's not turn this issue into another discussion of the general roadmap. These have already happened and are happening in other places.

Easy querying. This means operations like masking df[df[:A] .> 0.0, :] have to be made to work easily again. I've set up some of my own macros for doing things like @constrain(df, :A > 0.0). I'm a big fan of DataFramesMeta.jl, but unfortunately there's no branch for NullableArrays yet, and it isn't really being maintained. Work on something like DataFramesMeta would allow for easy querying in the presence of NullableArray.

See https://discourse.julialang.org/t/announcement-dataframes-0-9-0-planned-for-february/266. Though we obviously won't respect the schedule, which may imply changes in strategy, cf. https://github.com/JuliaStats/DataFrames.jl/issues/1154.

Allowing Vector column types (as opposed to just NullableVector) and relatively easy conversions between them. Usually at the end of the day missing values have to be dealt with one way or the other anyway. There should be an easy way of filling the missing values in individual columns and converting the columns in the dataframe to regular Vectors. Mixed column types should work whenever it is reasonable for them to do so.

https://github.com/JuliaStats/DataFrames.jl/issues/1119

A decision needs to be made on whether Nullable is the appropriate type. If it isn't, there needs to be a new equivalent. In most cases we expect our missing values to behave like NaN and be propagated. That's not really what Nullable was designed for. This has been discussed extensively, but a decision needs to be made.

https://github.com/JuliaLang/Juleps/pull/21

ararslan commented 7 years ago

@madeleineudell You may continue to use DataFrames with your package, as the Nullable-based code has been separated into the DataTables package.

madeleineudell commented 7 years ago

Great!

On Mar 13, 2017 4:22 PM, "Alex Arslan" notifications@github.com wrote:

@madeleineudell https://github.com/madeleineudell You may continue to use DataFrames with your package, as the Nullable-based code has been separated into the DataTables package.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JuliaStats/DataFrames.jl/issues/1148#issuecomment-286231699, or mute the thread https://github.com/notifications/unsubscribe-auth/AAyp9C1G-TaMfZixeVQBLRtdguxhvovXks5rlaWggaJpZM4LkLWa .

ararslan commented 7 years ago

I'm going to close this issue now, as any further concerns regarding Nullables in this context should be posted to the DataTables repo. Thank you for the initial report--I think it was instrumental in our decision to separate the projects.