JuliaData / DataFrames.jl

In-memory tabular data in Julia
https://dataframes.juliadata.org/stable/
Other
1.73k stars 367 forks source link

Add shuffle, shuffle! functions #2048

Closed rana closed 2 years ago

rana commented 4 years ago

Hi,

Would be helpful to see shuffle, shuffle! functions in DataFrames. Used in randomizing machine learning mini batches.

What do you think?

bkamins commented 4 years ago

Now you can do shuffle via df[shuffle(axes(df, 1)), :] but I agree we could add it.

@nalimilan - given we have settled to treat a DataFrame as a collection of rows I think it is OK to add it. If you agree, then I can make a PR.

rana commented 4 years ago

Thanks, I didn't know about df[shuffle(axes(df, 1)), :]. I will start using that in the mean time.

bkamins commented 4 years ago

A bit less efficient (but more aesthetic) way to do it is DataFrame(shuffle(eachrow(df))).

rana commented 4 years ago

Maybe also consider offering column shuffling?

shuffle(;cols=false)

shuffle!(;cols=false)

bkamins commented 4 years ago

We treat DataFrame as row oriented, so I would not implement column shuffling directly, rather this:

select(df, randperm(ncol(df)))

or this:

df[:, randperm(ncol(df))]

should be used

nalimilan commented 4 years ago

Reminds me of a similar discussion about sample. Maybe better leave this for post-1.0.

Shuffling columns doesn't sound too common, is it?

bkamins commented 4 years ago

Also another pattern that can be used to shuffle columns is df[randperm(nrow(df)), :].

An in-place operation is more challenging and will require a careful design.

OK - leaving this decision post 1.0 (mostly because it is easy to do this without this function).

rana commented 4 years ago

I haven't seen many column permutation examples, though I use it in my work. Appreciate the pointer on how to do it. When I'm deep in a language it is obvious. In this case I'm in multiple languages and frameworks and looking for convenience functions.

bkamins commented 4 years ago

Sure. I guess the point of @nalimilan is that we want to move towards 1.0 pretty soon.

In general - as we try to look at DataFrame as a collection of rows now I would be OK with adding shuffle and sample to it now. But @nalimilan is a kind of "ecosystem curator" (as it has to be consistent) so I prefer to delegate the final word to him 😄.

mahiki commented 3 years ago

I'd like to add a use case that is common in my work, for grouped dataframes. I want to shuffle the groups, which in my case consist of group of items with time series of transactions. Then I want to take the first N groups after shuffle (ie randomly select N groups).

Maybe there is a similarly simple way to shuffle the grouped df

The following process demonstrates the steps I'm currently taking:

df = DataFrame(time = [1, 2, 1, 2, 1, 2]
    , amt = [19.00, 11.00, 35.50, 32.50, 5.99, 5.99]
    , item = ["B001", "B001", "B020", "B020", "BX00", "BX00"])

6×3 DataFrame
│ Row │ time  │ amt     │ item   │
│     │ Int64 │ Float64 │ String │
├─────┼───────┼─────────┼────────┤
│ 1   │ 1     │ 19.0    │ B001   │
│ 2   │ 2     │ 11.0    │ B001   │
│ 3   │ 1     │ 35.5    │ B020   │
│ 4   │ 2     │ 32.5    │ B020   │
│ 5   │ 1     │ 5.99    │ BX00   │
│ 6   │ 2     │ 5.99    │ BX00   │

using StatsBase, Pipe
@pipe df |> groupby(_, :item) |>
         combine(_, :time, :amt, :item, :item => (x -> rand()) => :rando) |>
         sort(_, :rando) |>
         transform(_, :rando => denserank => :rnk_rnd)

6×5 DataFrame
│ Row │ item   │ time  │ amt     │ rando    │ rnk_rnd │
│     │ String │ Int64 │ Float64 │ Float64  │ Int64   │
├─────┼────────┼───────┼─────────┼──────────┼─────────┤
│ 1   │ BX00   │ 0     │ 5.99    │ 0.241881 │ 1       │
│ 2   │ BX00   │ 1     │ 5.99    │ 0.241881 │ 1       │
│ 3   │ B001   │ 0     │ 19.0    │ 0.292468 │ 2       │
│ 4   │ B001   │ 1     │ 11.0    │ 0.292468 │ 2       │
│ 5   │ B020   │ 0     │ 35.5    │ 0.70816  │ 3       │
│ 6   │ B020   │ 1     │ 32.5    │ 0.70816  │ 3       │

# I only want the original columns
 @pipe filter(:rnk_rnd => <=(2), res)  |>
         select(_, :item, :time, :amt)

4×3 DataFrame
│ Row │ item   │ time  │ amt     │
│     │ String │ Int64 │ Float64 │
├─────┼────────┼───────┼─────────┤
│ 1   │ BX00   │ 1     │ 5.99    │
│ 2   │ BX00   │ 2     │ 5.99    │
│ 3   │ B020   │ 1     │ 35.5    │
│ 4   │ B020   │ 2     │ 32.5    │
mahiki commented 3 years ago

Got it:

# take the first 2 shuffled groups
@pipe df |> groupby(_, :item) |>
    _[shuffle(1:end)] |>
    combine(_[1:2], :)

4×3 DataFrame
│ Row │ item   │ time  │ amt     │
│     │ String │ Int64 │ Float64 │
├─────┼────────┼───────┼─────────┤
│ 1   │ BX00   │ 0     │ 5.99    │
│ 2   │ BX00   │ 1     │ 5.99    │
│ 3   │ B001   │ 0     │ 19.0    │
│ 4   │ B001   │ 1     │ 11.0    │

I guess i'll put it up on stack overflow.

bkamins commented 3 years ago

Adding this and sample is planned but after 0.22 release as it is non-breaking.