Closed madeleineudell closed 7 years ago
See also https://github.com/JuliaStats/DataFrames.jl/issues/1092. It really depends on what the package does. Where possible, the plan is to use high-level APIs provided by StructuredQueries (not ready yet). Then, packages working on numeric data should just convert NullableArray
to Array
, or similarly call get
on Nullable
, since they cannot accept missing values anyway. Other cases will need special handling but it's hard to say anything without more details.
Doesn't @davidgold have an autolifting scheme in the works?
Automatic lifting will only be enabled inside query macros. OTOH element-wise operators are now lifting in Julia 0.6.
What about broadcasting of elwise?
What do you mean?
@madeleineudell Can you tell us more about the issues you experience?
Sure; when I go to index elements of the DataFrame x = df[3,4]
, the
element x is of type Nullable{whatever}
. In the previous incarnation of
DataFrames, x was of type whatever
, and so methods designed for
whatever
s worked on x. For example, I have a bunch of code in which
whatever
is an Int or a Float64 or a Bool, and so I merrily take x and
multiply, divide, exponentiate etc. Now, I'd need to have my code use
x.value rather than x.
This wouldn't be a problem (other than a coding pain, because I can't just
use a simple find-replace), except that my code is also designed to work
with Matrices as well as with DataFrames. So that means that everywhere in
my code I'd need to sprinkle if isa(x, Nullable) ...
, or I'd need to
define a new method for every function (+,-,exp,...) whose input might be
nullable, etc. In other words, it's pretty annoying if DataFrames and
Matrices are no longer interoperable.
If you overloaded all the functions on integers so that +(x::Nullable{Int}, y::Int) = +(x.value, y), for every operation, and ditto for other argument types (Float64, Bool, etc), then that would go some way towards fixing this. But I don't think every package maintainer who uses DataFrames should have to write all those macros. (And in my case, it would take me quite some time to figure out how to do so.)
On Wed, Jan 25, 2017 at 12:57 AM, Milan Bouchet-Valat < notifications@github.com> wrote:
@madeleineudell https://github.com/madeleineudell Can you tell us more about the issues you experience?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JuliaStats/DataFrames.jl/issues/1148#issuecomment-275052903, or mute the thread https://github.com/notifications/unsubscribe-auth/AAyp9O8aNN0iwG6v85Fnyx6KPH3oGtGCks5rVw5egaJpZM4LkLWa .
-- Madeleine Udell Assistant Professor, Operations Research and Information Engineering Cornell University https://people.orie.cornell.edu/mru8/ (415) 729-4115
Honestly, I don't think data frames were ever considered as equivalent to matrices (@johnmyleswhite may want to comment); they are more like databases That said, NullableArray
currently defines standard operators on Nullable
, and Julia 0.6 supports element-wise versions for lifting, even when mixing nullable and non-nullable arguments, e.g. Nullable(1) .+ 1 -> Nullable(2)
. Would that suit your needs?
By the way, what package are we talking about? That would certainly help me to understand your requirements.
I'm talking about LowRankModels in particular. We do PCA, sparse PCA, nonnegative matrix factorization, one-bit PCA etc on both fully observed matrices and partially observed tables of data.
I think that element-wise (automatic or accessible by using a single additional package) lifting would indeed suit my needs, but I'd need to check...!
On Wed, Jan 25, 2017 at 1:49 PM, Milan Bouchet-Valat < notifications@github.com> wrote:
By the way, what package are we talking about? That would certainly help me to understand your requirements.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JuliaStats/DataFrames.jl/issues/1148#issuecomment-275243442, or mute the thread https://github.com/notifications/unsubscribe-auth/AAyp9Mk5kNqSbBSydu5HHjcQ9cX1OPzJks5rV8NRgaJpZM4LkLWa .
-- Madeleine Udell Assistant Professor, Operations Research and Information Engineering Cornell University https://people.orie.cornell.edu/mru8/ (415) 729-4115
OK. I would have thought you wouldn't actually have to work with data frames: using StatsModels to transform a dataframe+formula to a matrix, and then work only with standard matrices. Is that possible?
I think it should be possible to use StatsModels, but it requires a complete rethinking of the way modeling works in LowRankModels.
On Wed, Jan 25, 2017 at 2:01 PM, Milan Bouchet-Valat < notifications@github.com> wrote:
OK. I would have thought you wouldn't actually have to work with data frames: using StatsModels to transform a dataframe+formula to a matrix, and then work only with standard matrices. Is that possible?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JuliaStats/DataFrames.jl/issues/1148#issuecomment-275246572, or mute the thread https://github.com/notifications/unsubscribe-auth/AAyp9KGlGh-OCziyE3lrr3Z5kPX653Tpks5rV8YugaJpZM4LkLWa .
-- Madeleine Udell Assistant Professor, Operations Research and Information Engineering Cornell University https://people.orie.cornell.edu/mru8/ (415) 729-4115
It might be as simple as adding separate constructors for a DataFrame
that converts it to a matrix via ModelMatrix(ModelFrame(d, f))
(with an appropriately constructed formula).
Alternatively, is it possible to convert(Matrix, d)
?
Or, use the StatisticalModel
type from StatsBase (which will be moved to StatsModels soon, once someone finds the time), and then you get methods for dataframes + formulas for free (via the DataFrameStatisticalModel
type, https://github.com/JuliaStats/StatsModels.jl/blob/master/src/statsmodel.jl). The only pain there is constructing an appropriate formula.
Sorry, I see now that you're doing something more interesting than just converting the dataframe with numeric columns to a matrix. I still think that it makes sense to use ModelMatrix
. In fact, the goal of StatsModels.jl is to provide tools for converting tables with a mix of categorical/numerical/etc. data into matrices for modeling, so yours is exactly the kind of use case we're aiming for.
What is the cost of converting a very large dataframe or database? Won't this be prohibitive sometimes vs operating in place?
Sure, but as far as I can tell LowRankModels already assumes things will fit in memory (but I could be wrong). One of the goals is to generalize the current implementation to work with a more general interface that doesn't assume things are in-memory but can work with, say, chunks of a table or an iterator of row tuples.
I just wanted to throw my comments in here as someone else who has been using the current master and NullableArray
s extensively, but hasn't written a single line of code for DataFrames.jl itself.
From my experience so far, DataFrames with the NullableArray
s back-end needs 3 major quality-of-life improvements before developers of other packages can be reasonably expected to use it:
Easy querying. This means operations like masking df[df[:A] .> 0.0, :]
have to be made to work easily again. I've set up some of my own macros for doing things like @constrain(df, :A > 0.0)
. I'm a big fan of DataFramesMeta.jl, but unfortunately there's no branch for NullableArray
s yet, and it isn't really being maintained. Work on something like DataFramesMeta would allow for easy querying in the presence of NullableArray
.
Allowing Vector
column types (as opposed to just NullableVector
) and relatively easy conversions between them. Usually at the end of the day missing values have to be dealt with one way or the other anyway. There should be an easy way of filling the missing values in individual columns and converting the columns in the dataframe to regular Vector
s. Mixed column types should work whenever it is reasonable for them to do so.
A decision needs to be made on whether Nullable
is the appropriate type. If it isn't, there needs to be a new equivalent. In most cases we expect our missing values to behave like NaN
and be propagated. That's not really what Nullable
was designed for. This has been discussed extensively, but a decision needs to be made.
Please, let's not turn this issue into another discussion of the general roadmap. These have already happened and are happening in other places.
Easy querying. This means operations like masking df[df[:A] .> 0.0, :] have to be made to work easily again. I've set up some of my own macros for doing things like @constrain(df, :A > 0.0). I'm a big fan of DataFramesMeta.jl, but unfortunately there's no branch for NullableArrays yet, and it isn't really being maintained. Work on something like DataFramesMeta would allow for easy querying in the presence of NullableArray.
See https://discourse.julialang.org/t/announcement-dataframes-0-9-0-planned-for-february/266. Though we obviously won't respect the schedule, which may imply changes in strategy, cf. https://github.com/JuliaStats/DataFrames.jl/issues/1154.
Allowing Vector column types (as opposed to just NullableVector) and relatively easy conversions between them. Usually at the end of the day missing values have to be dealt with one way or the other anyway. There should be an easy way of filling the missing values in individual columns and converting the columns in the dataframe to regular Vectors. Mixed column types should work whenever it is reasonable for them to do so.
https://github.com/JuliaStats/DataFrames.jl/issues/1119
A decision needs to be made on whether Nullable is the appropriate type. If it isn't, there needs to be a new equivalent. In most cases we expect our missing values to behave like NaN and be propagated. That's not really what Nullable was designed for. This has been discussed extensively, but a decision needs to be made.
@madeleineudell You may continue to use DataFrames with your package, as the Nullable
-based code has been separated into the DataTables package.
Great!
On Mar 13, 2017 4:22 PM, "Alex Arslan" notifications@github.com wrote:
@madeleineudell https://github.com/madeleineudell You may continue to use DataFrames with your package, as the Nullable-based code has been separated into the DataTables package.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JuliaStats/DataFrames.jl/issues/1148#issuecomment-286231699, or mute the thread https://github.com/notifications/unsubscribe-auth/AAyp9C1G-TaMfZixeVQBLRtdguxhvovXks5rlaWggaJpZM4LkLWa .
I'm going to close this issue now, as any further concerns regarding Nullable
s in this context should be posted to the DataTables repo. Thank you for the initial report--I think it was instrumental in our decision to separate the projects.
I'm maintaining a package that depends on DataFrames, and I'm a bit at a loss as to how to maintain the package at this point with the addition of the Nullable type.
My code uses plenty of arithmetic, indexing, type checking, etc etc. Now none of this works because most functions on Nullables don't propagate to the underlying value even if that instance of the nullable is not null. So addition doesn't work; subtraction doesn't work; type checking for integers or reals doesn't work; etc.
I could rewrite all these functions but that's a huge amount of work, and it's not clear to me from reading these issues whether the API is stable yet anyway. And I certainly won't be able to maintain backward compatibility, which saddens me.
What's your recommendation for package maintainers who depend on DataFrames?