Closed rana closed 2 years ago
Now you can do shuffle
via df[shuffle(axes(df, 1)), :]
but I agree we could add it.
@nalimilan - given we have settled to treat a DataFrame
as a collection of rows I think it is OK to add it. If you agree, then I can make a PR.
Thanks, I didn't know about df[shuffle(axes(df, 1)), :]
. I will start using that in the mean time.
A bit less efficient (but more aesthetic) way to do it is DataFrame(shuffle(eachrow(df)))
.
Maybe also consider offering column shuffling?
shuffle(;cols=false)
shuffle!(;cols=false)
We treat DataFrame
as row oriented, so I would not implement column shuffling directly, rather this:
select(df, randperm(ncol(df)))
or this:
df[:, randperm(ncol(df))]
should be used
Reminds me of a similar discussion about sample
. Maybe better leave this for post-1.0.
Shuffling columns doesn't sound too common, is it?
Also another pattern that can be used to shuffle columns is df[randperm(nrow(df)), :]
.
An in-place operation is more challenging and will require a careful design.
OK - leaving this decision post 1.0 (mostly because it is easy to do this without this function).
I haven't seen many column permutation examples, though I use it in my work. Appreciate the pointer on how to do it. When I'm deep in a language it is obvious. In this case I'm in multiple languages and frameworks and looking for convenience functions.
Sure. I guess the point of @nalimilan is that we want to move towards 1.0 pretty soon.
In general - as we try to look at DataFrame
as a collection of rows now I would be OK with adding shuffle
and sample
to it now. But @nalimilan is a kind of "ecosystem curator" (as it has to be consistent) so I prefer to delegate the final word to him 😄.
I'd like to add a use case that is common in my work, for grouped dataframes. I want to shuffle the groups, which in my case consist of group of items with time series of transactions. Then I want to take the first N groups after shuffle (ie randomly select N groups).
Maybe there is a similarly simple way to shuffle the grouped df
The following process demonstrates the steps I'm currently taking:
df = DataFrame(time = [1, 2, 1, 2, 1, 2]
, amt = [19.00, 11.00, 35.50, 32.50, 5.99, 5.99]
, item = ["B001", "B001", "B020", "B020", "BX00", "BX00"])
6×3 DataFrame
│ Row │ time │ amt │ item │
│ │ Int64 │ Float64 │ String │
├─────┼───────┼─────────┼────────┤
│ 1 │ 1 │ 19.0 │ B001 │
│ 2 │ 2 │ 11.0 │ B001 │
│ 3 │ 1 │ 35.5 │ B020 │
│ 4 │ 2 │ 32.5 │ B020 │
│ 5 │ 1 │ 5.99 │ BX00 │
│ 6 │ 2 │ 5.99 │ BX00 │
using StatsBase, Pipe
@pipe df |> groupby(_, :item) |>
combine(_, :time, :amt, :item, :item => (x -> rand()) => :rando) |>
sort(_, :rando) |>
transform(_, :rando => denserank => :rnk_rnd)
6×5 DataFrame
│ Row │ item │ time │ amt │ rando │ rnk_rnd │
│ │ String │ Int64 │ Float64 │ Float64 │ Int64 │
├─────┼────────┼───────┼─────────┼──────────┼─────────┤
│ 1 │ BX00 │ 0 │ 5.99 │ 0.241881 │ 1 │
│ 2 │ BX00 │ 1 │ 5.99 │ 0.241881 │ 1 │
│ 3 │ B001 │ 0 │ 19.0 │ 0.292468 │ 2 │
│ 4 │ B001 │ 1 │ 11.0 │ 0.292468 │ 2 │
│ 5 │ B020 │ 0 │ 35.5 │ 0.70816 │ 3 │
│ 6 │ B020 │ 1 │ 32.5 │ 0.70816 │ 3 │
# I only want the original columns
@pipe filter(:rnk_rnd => <=(2), res) |>
select(_, :item, :time, :amt)
4×3 DataFrame
│ Row │ item │ time │ amt │
│ │ String │ Int64 │ Float64 │
├─────┼────────┼───────┼─────────┤
│ 1 │ BX00 │ 1 │ 5.99 │
│ 2 │ BX00 │ 2 │ 5.99 │
│ 3 │ B020 │ 1 │ 35.5 │
│ 4 │ B020 │ 2 │ 32.5 │
Got it:
# take the first 2 shuffled groups
@pipe df |> groupby(_, :item) |>
_[shuffle(1:end)] |>
combine(_[1:2], :)
4×3 DataFrame
│ Row │ item │ time │ amt │
│ │ String │ Int64 │ Float64 │
├─────┼────────┼───────┼─────────┤
│ 1 │ BX00 │ 0 │ 5.99 │
│ 2 │ BX00 │ 1 │ 5.99 │
│ 3 │ B001 │ 0 │ 19.0 │
│ 4 │ B001 │ 1 │ 11.0 │
I guess i'll put it up on stack overflow.
Adding this and sample
is planned but after 0.22 release as it is non-breaking.
Hi,
Would be helpful to see shuffle, shuffle! functions in DataFrames. Used in randomizing machine learning mini batches.
What do you think?