JuliaData / DataFrames.jl

In-memory tabular data in Julia
https://dataframes.juliadata.org/stable/
Other
1.72k stars 365 forks source link

Creating new columns on a `view` should fill in missings everywhere else. #2211

Closed pdeffebach closed 2 years ago

pdeffebach commented 4 years ago

This has been kicking around in my head for a while, but a SubDataFrame should allow creation of new columns.

Current behavior is:

julia> df = DataFrame(
               id = rand(["a", "b", "c", "d"], 100),
               a = rand(100),
               b = 100 * randn(100),
               c = rand([100, 200, 300], 100)
       );

julia> sdf = view(df, 1:50, :);

julia> sdf[:, :d] = rand(nrow(sdf))
ERROR: ArgumentError: column name :d not found in the data frame; existing most similar names are: :a, :b, :c and :id

I propose we add a convenience function if, to enable workflow as follows (with magittr pipes for convenience)

df %>%
    if(:a => ByRow(!missing)) %>%
    transform(:b => (t -> t .- mean(x)) =`> :d)

The transform function only knows about rows where :a is not missing. This replicates the behavior of Stata's if syntax.

bkamins commented 4 years ago

I do not know Stata well enough. Can you show a minimal example of expected input and output please? I understand that rows that do not meet the predicate are filled with missing or with some definable sentinel value?

bkamins commented 4 years ago

I was thinking about it today. If we added it would you also request this to be added to DataFrameRow and both for setindex! and broadcasting? All this is technically doable but I would be reluctant to add this functionality without a strong reason.

Therefore it would be great if you gave some use-cases justifying the need, and in particular showing that just using broadcasted ifelse or a comprehension to create that column would not be enough.

pdeffebach commented 4 years ago

I was thinking about it today. If we added it would you also request this to be added to DataFrameRow and both for setindex! and broadcasting?

I would not advocate it for DataFrameRow but I would like it for setindex!.

In Stata, it is very common to add if clauses to the ends of commands such that you only modify a subset of rows.

gen income_low = income if filled_out_income_form & income < 10 
egen income_demeaned_low = income - mean(income) if !missing(form_id)
replace income = 100 if income > 100 // only replace a subset of observations

The first command will copy over values in column income to income_low, but only for observations that satisfy these conditions. missing is filled in elsewhere. The second command's mean function only applies to observations that are not missing form_id and fills in missing values similarly.

This is a great syntax. Stata users love it. It reads very nicely like a sentence. "Generate this variable if it satisfies these conditions"

Not only is this feature useful for end users, it's very easy to implement for programmers. When writing a stata program, the opening syntax works like this

program define myfun
    syntax varlist [if], option(string)
    marksample touse // "to use"
    gen t = 1 if `touse'
end

The command marksample touse creates a variable touse which is equal to 1 if the if conditions in the function call are satisfied and 0 otherwise. Now the writer of the function can just add if qualifiers to everything they do and not worry about problems with other observations. But no rows are lost because an actual subset is never performed.

Why this is good syntax:

  1. Julia requires a lot of care with missing values. Perhaps the user knows that a subset of their observations have complete data. They can pass a sub-dataframe to their function modifying the data-frame and not worry about implementation.
  2. Having many data-frames in memory is scary. Being able to have exactly one data frame in memory and use transform!, etc. exclusively means it is easier to work with large datasets where you don't know which observations and values you will need later on.

Here is a very rough outline for how it might be implemented.

  1. Add a field to SubDataFrame showing if it can be modified or not. The default is no and can only be modified if it is constructed via a function I will call ifdataframe for now. Constructors will look like

    ifdataframe(df, :x => (t -> t .> mean(t))

All of the Pair{Symbol, Function} arguments return boolean arrays.

  1. Now a user can write a function modfy!(df::AbstractDataFrame) and add new columns as needed. modify!(ifdataframe(df, :x => (t -> t .> mean(t))) will only modify columns with above-average values of :x. transform! could be one such function.

This is a feature that both @jmboehm and @matthieugomez have both requested, without stating it as formally.

matthieugomez commented 4 years ago

Tbh, I don’t think this is first order (ifelse does the same thing albeit less elegant). Maybe doing a specific type where() similarly to groupby() would be nice, but it’s more of an upgrade than a fundamental change.

nalimilan commented 4 years ago

Supporting this could make sense, but if we allow it it should work on any SubDataFrame. No need to introduce special cases. Though maybe convenience methods could be added to view to allow using the same syntax as filter to choose which rows should be retained.

bkamins commented 4 years ago

Currently SubDataFrame supports all that is requested here except two cases:

  1. an existing column in the parent would have to have its eltype changed (as it requires to promote type of the column)
  2. a new column is requested to be created

Also I think immediately after we add this we will get a request for two additional features:

Therefore in the long term I would rather think of appropriate kwargs to select et al. functions, but I would leave this for later as this is mostly convenience functionality.

Still in this thread we can try work out the API.

matthieugomez commented 4 years ago

So I’ve been thinking about that a bit more. Here is an idea for a syntax that builds on the recent changes (to be clear: its not a request ;))

# create a new column equal to mean of x by id for rows for which x> 1
@> df by(:id) where(:x => x -> x > 1) transform!(:x => mean)
# return a DataFrame with the mean of x by id for rows for which x>1
@> df by(:id) where(:x => x -> x > 1) combine(:x => mean)

The idea is that by and where are clauses.

The design is of course inspired by Stata and SQL where where (if in Stata) and by clauses can be applied to any method.

A nice thing about the current design (https://github.com/JuliaData/DataFrames.jl/pull/2214) is that these clauses don’t have to live in DataFrames, so one could try them in an outside package first.

jmboehm commented 4 years ago

Following up on Matthieu's idea, it seems to me a large class of operations on tabular data df_1,..., df_k can be described as a triple of

Split-apply-combine, reshapes, merges etc are all examples of these. Of course the above is so general that it's probably useless to think of in this generality. Still, the question is where you want to provide an interface to reduce the dimensionality of the above.

Hence: should by and where be the only clauses? Why have these two if the meaning varies across methods? My pet peeve is, for example, that I never remember whether if in Stata applies to the assignment or to the subset of the dataframe/group...

bkamins commented 4 years ago

We are converging here towards a full:

SELECT column_name(s)
FROM table_name
WHERE condition
GROUP BY column_name(s)
HAVING condition
ORDER BY column_name(s);

combo :smile: (of course it would be nice to have it all covered). Probably as @matthieugomez it would be good to experiment with it in some side package (or a PR that would not be intended to be merged) so that the syntax could be worked out without rushing.

pdeffebach commented 4 years ago
# create a new column equal to mean of x by id for rows for which x> 1
@> df by(:id) where(:x => x -> x > 1) transform!(:x => mean)
# return a DataFrame with the mean of x by id for rows for which x>1
@> df by(:id) where(:x => x -> x > 1) combine(:x => mean)

Yes this is exactly my proposal.

I should add that I don't think what you describe is that different from the proposal for grouped transforms. A grouped data frame is, after all, a data frame with some information about row indexes for each group attached to it.

A sub data frame is just a dataframe with row-indices attached to it.

Composability is nice, though. I will make sure to explore this more thoroughly after Finals and probably after Quals as well.

matthieugomez commented 4 years ago

@pdeffebach yes, without where it’s exactly what https://github.com/JuliaData/DataFrames.jl/pull/2214 already does.

The main difference with your proposal is to use a type different from SubDataFrame (and, in particular, that does not inherit from AbstractDataFrame)

bkamins commented 4 years ago

A sub data frame is just a dataframe with row-indices attached to it.

also column indices

floswald commented 4 years ago

yes, i am missing the functionality of adding/modifying a column on a subset of a dataframe only. I think it's really important. I think a good reference here (alternative to the magrittr pipe mentioned above) is the R data.table setup. The simple subsetting with [ is really clean IHMO. Not saying that's the way to go, but it's a reference.

> library(data.table)
data.table 1.12.8 using 4 threads (see ?getDTthreads).  Latest news: r-datatable.com
> dt = data.table(a = 1:5, b = rnorm(5))
> dt
   a          b
1: 1  1.2576032
2: 2  1.6769647
3: 3  1.9672758
4: 4 -0.4590235
5: 5 -2.0694431
> dt[ a > 3, newcol := b + 2 ]  # do where (a>3) what (add newcol)
> dt
   a          b      newcol
1: 1  1.2576032          NA
2: 2  1.6769647          NA
3: 3  1.9672758          NA
4: 4 -0.4590235  1.54097647
5: 5 -2.0694431 -0.06944313
bkamins commented 4 years ago

what you write is currently supported, but with a different syntax:

julia> dt = DataFrame(a=1:5, b=randn(5))
5×2 DataFrame
│ Row │ a     │ b         │
│     │ Int64 │ Float64   │
├─────┼───────┼───────────┤
│ 1   │ 1     │ -0.142721 │
│ 2   │ 2     │ 0.888727  │
│ 3   │ 3     │ -1.39633  │
│ 4   │ 4     │ 0.617181  │
│ 5   │ 5     │ 0.974018  │

julia> dt[!, :b] = ifelse.(dt.a .> 3, dt.b .+ 2, missing)
5-element Array{Union{Missing, Float64},1}:
  missing
  missing
  missing
 2.6171813485634132
 2.9740178167229954

julia> dt
5×2 DataFrame
│ Row │ a     │ b        │
│     │ Int64 │ Float64? │
├─────┼───────┼──────────┤
│ 1   │ 1     │ missing  │
│ 2   │ 2     │ missing  │
│ 3   │ 3     │ missing  │
│ 4   │ 4     │ 2.61718  │
│ 5   │ 5     │ 2.97402  │

what we do not allow is for a view to mutate the parent (in particular a view can have column :b omitted, but its parent might have had it)

floswald commented 4 years ago

Oh I see! that's good then. sorry about the noise.

bkamins commented 4 years ago

You are welcome. Note also that with ifelse you have a choice of sentinel, as in some cases missing is not wanted, but you might e.g. prefer nothing or 0.

bkamins commented 4 years ago

Given #2329 if you have some comments after some time has passed, please add your opinion. Thank you!

pdeffebach commented 4 years ago

I think I may try my hand at an IfDataFrame that functions as a mutable view. We will see how much success this gets. I will copy over the SubDataFrame code and change what's needed.

bkamins commented 4 years ago

I think I may try my hand at an IfDataFrame

Actually, if we see the value in allowing the view to modify parents, I would prefer not to add a new object that is very similar to the one we have.

I understand the rules would be:

  1. if column exists - all works for SubDataFrame as it works now
  2. you are still not allowed to add rows via SubDataFrame (i.e. you are not allowed to push! and append! to it - as come columns that are not in a view might not allow missing)
  3. you are allowed to create a new column for a SubDataFrame IF this column name does not exist in parent, then this column:
    • would be added as the last column in parent
    • would always be Union{Missing, RequestedType} and would fill with missing the values in rows that are not selected
    • we would then need to add setindex!, and broadcast! implementations that would handle this case (`broadcast! will be hardest) keeping the rule we have now:

Note that sdf[!, col] = v, sdf[!, cols] = v and sdf.col = v are not allowed as sdf can be only modified in-place.

Is this what you think would be good?

If yes - and you go for implementing it - the key challenge is handling of SubIndex, so that after adding a column to a view this view gets the requested column added. As a particular case note that you use : to create a view then it uses Index not SubIndex so the situation is different (simpler).

Actually - maybe we want to allow this operation only if SubDataFrame has ALL columns selected with : (this is probably the use case that we have in practice)? If this is the case then the implementation will be much easier, but we will need to carefully document it.

bkamins commented 4 years ago

Having thought more about it I am leaning even more to allow adding columns only in view(df, row_selector, :). We even could then define insertcols! for this case also.

Then we would have to highlight more how different is view(df, row_selector, :) from view(df, row_selector, col_selector) in the documentation, but it is already very different so it is OK to have this distincion.

The general idea is that when you write view(df, row_selector, :) you explicitly say you want to subset only rows (and already currently such view gets resized if df is resized), while view(df, row_selector, col_selector) asks for slicing, so then it is natural to expect that it does not allow adding columns (and it does not get resized if you change the parent).

pdeffebach commented 4 years ago

Thanks for this! I think this is what I want. I hadn't thought about the push! and append! cases, but you are right.

One major reason for this logic is that it means we wouldn't have to do as much work in #2258. We could overload transform for a SubDataFrame and it would "just work".

This would also mean that our "piping" functions are less special than working normally with DataFrame objects. cc @matthieugomez about this.

matthieugomez commented 4 years ago

As mentioned earlier in the thread, I think I would still prefer a WhereDataFrame type that disappears after a transform/select/combine call.

For instance,

transform!(where(df, :x => >(1), x => mean(x))
transform!(where(df, :x => x -> !ismissing(x)), x => mean(x))

would return the original (parent) DataFrame.

But maybe having such a functionality would be too complex.

On Thu, Aug 6, 2020 at 2:57 PM pdeffebach notifications@github.com wrote:

Thanks for this! I think this is what I want. I hadn't thought about the push! and append! cases, but you are right.

One major reason for this logic is that it means we wouldn't have to do as much work in #2258 https://github.com/JuliaData/DataFrames.jl/issues/2258. We could overload transform for a SubDataFrame and it would "just work".

This would also mean that our "piping" functions are less special than working normally with DataFrame objects. cc @matthieugomez https://github.com/matthieugomez about this.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JuliaData/DataFrames.jl/issues/2211#issuecomment-670211867, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPPPXLFED6AXGP4FY5W4WLR7MRNNANCNFSM4MQ44KJQ .

bkamins commented 4 years ago

Another option, if we only need this functionality in piping and select!/transform! etc. would be to add where kwarg to therm as discussed earlier so you would write:

transform!(df, where = :x => >(1), :x => mean)
transform!(df, where = :x => !ismissing, :x => mean)

the question is if we want to also be able to do such things in imperative style, so:

dfv = @view df[some_rows, :]
dfv[:, newcol] .= 1 # and :newcol has 1 for some_rows and missing in other places

If we need it only for piping style I feel I would prefer to use a kwarg for this in the functions that we want to support it.

pdeffebach commented 4 years ago

I don't think we only need this functionality in piping. I think that this reasoning would be useful using getindex and setindex as well. That's why I propose this, so we can get transform "for free".

This solution also strikes me as very "julian" since it relies on method overloading and is similar to our use of groupby.

bkamins commented 4 years ago

So as you have commented in the other thread to support getindex and setindex! we would need another type (actually we could extend GroupedDataFrame to be able to handle this use case, but I think it would be a mental overkill).

So it would have to be FilteredDataFrame (or WhereDataFrame), it would be created with filter with view=true and follow the logic we have described here (in particular being automatically unwrapped after transformations).

I would propose to think about it a bit and decide if users will have an easy time understanding how it works or not. Probably writing some more code snippets what would show how it should be used would be nice.

In particular let me stress what tension we have here:

  1. use SubDataFrame for this - we can provide all you want and it is a small change, but it will get a DataFrame with filtered rows in transform/select and after transform!/select! you will still get a SubDataFrame in result
  2. use FilteredDataFrame for this - this is much more work, all would behave the same as in point 1., except that it would return a DataFrame (either a new one or a parent) updated with all rows. Also in this case we would have to decide what behavior we want to have if FilteredDataFrame is called on SubDataFrame as it is a corner case we should decide what to do with; in particular note that currently SubDataFrame allows for duplicate rows in the view)
nalimilan commented 4 years ago

I agree that adding new type of data frame view is probably overkill. I'd rather recommend people to use in-place operations like transform!, and instead of piping use the original data frame, like this:

df = DataFrame(...)
transform!(filter(:x => >(1), df, view=true), :x => mean)
df
...

Of course if you prefer piping you can add parent() in the series of operations.

Adding a where/filter keyword argument to select and transform is another option (that can be added at any time), but I'm a bit hesitant because it's hard to know where to stop: we would probably have to add groupby too, then maybe also sort, etc.

bkamins commented 4 years ago

it's hard to know where to stop

That was my fear. Having this via view is a clean solution that can be handled in implementation without complicating the code with "special cases" (sorry for caring about it - but I also think about not having the code base so complicated that at some point no one will be able to understand how it works and we already have several places where it is really complicated, especially with GroupedDataFrame).

matthieugomez commented 4 years ago

I really think a WhereDataFrame would be an awesome addition to the package. Beyond this issue, it would solve at least two other open issues:

Just to clarify the proposal, the function where (or similarly named) creates a WhereDataFrame from a DataFrame and a couple of conditions:

where(df, :x => .>=1, :y => .>=2))

In contrast with filter, conditions are column-wise by default, and rows that evaluate to missing are not kept (i.e. coalesce to false by default). WhereDataFrame does not inherit from DataFrame.

Then, the functions left to define on a WhereDataFrame are: filter(wdf::WhereDataFrame) : return a DataFrame with only the selected rows transform(wdf::WhereDataFrame, args...): return a DataFrame where only the selected rows are modified, and missing otherwise combine(wdf::WhereDataFrame, args...): return the same thing as combine(filter(wdf), args...)

This does not need to happen now, since this proposal is not breaking, but I just wanted to say I think it's the right way forward to solve a number of issues we've been talking about recently.

bkamins commented 4 years ago

I am OK with where as noted elsewhere (i.e. drop missing, col-wise by default, reverse argument order).

The question is if we need another type. An alternative we have been discussing is to allow SubDataFrame with : as column selector to play this role. The difference would be that it would work only for "in-place" functions: select!, transform!, setindex!, insertcols! and broadcasting assignment.

In this case we should just add a view kwarg to where just as proposed in https://github.com/JuliaData/DataFrames.jl/pull/2386 (the benefit of this design would be that all functions listed there + view could be used to "filter" the data frame and then pass it for transformations not just where).

Having a separate WhereDataFrame type has a benefit that you could define for it copying select and transform functions that would produce the whole original data frame but copied. But I am not sure this is that valuable - but maybe I am wrong. Let us discuss it.

bkamins commented 4 years ago

Ah - now I see one difference of my approach and your approach when you would assign to an EXISTING column (as opposed to creating a new column).

I am not sure which approach should be preferred in practice (maybe both.)

nalimilan commented 4 years ago

The latter behavior could be obtained with transform!(sdf, col => fun => col), which should replace column col with a new one (just like it currently does for DataFrame) rather than replace its contents in-place.

I think we should first see how far we can get with DataFramesMeta without introducing new types, and then we'll see whether it's still worth it.

bkamins commented 4 years ago

The latter behavior could be obtained with transform!(sdf, col => fun => col), which should replace column col with a new one (just like it currently does for DataFrame) rather than replace its contents in-place.

It could - there is no restriction how we do it. My major concern is if we want to have both behaviors or having one of them is enough (indeed the former can be obtained by just writing df[rows, col] = something without using views so maybe this is not needed).

But in general - if this is the case - then we have all we want with SubDataFrame.

nalimilan commented 4 years ago

Yeah it could, but I think it would be more consistent with DataFrame to replace the column.

bkamins commented 4 years ago

Then - such SubDataFrame would be exactly WhereDataFrame that @matthieugomez wants.

@matthieugomez - do you see any differences?

bkamins commented 2 years ago

Closing this as in 1.3 release we allow adding columns to SubDataFrame if it does not subset columns. @pdeffebach / @matthieugomez : please reopen if I missed something.