bnicenboim / eeguana

A package for manipulating EEG data in R.
https://bnicenboim.github.io/eeguana/
Other
21 stars 9 forks source link

tidytable design discussion #139

Closed markfairbanks closed 2 years ago

markfairbanks commented 4 years ago

I figured this might be easier for us to talk through any of your tidytable questions. (It's a bit easier to share code snippets).

Feel free to close the issue, but I figure we can keep discussing here either way.

why are you using your own shallow() function instead of just data.table::copy()? What's the advantage?

In R there are two types of copies - "shallow" copies and "deep" copies.

Deep copies copy/duplicate the entire object - this is what data.table::copy() does.

Shallow copies copy/preserve object structure. For data.tables this means it preserves the number of columns and the column names. However the columns in the data.table can still be overwritten.

In base R or the tidyverse you never have to worry about the difference between the two - they always "copy-on-modify" by applying whichever one is necessary depending on your command.

Deep copies are a bit easier to understand. Deep copies always prevent "modify-by-reference" of the underlying object, whereas shallow copies only sometimes do.

Seems like shallow copies aren't worth it - but they're much faster:

library(data.table)

shallow <- tidytable:::shallow

data_size <- 10000000
large_df <- data.table(x = sample(1:5, data_size, TRUE),
                       y = sample(c("a", "b", "c"), data_size, TRUE))

bench::mark(shallow_copy = shallow(large_df),
            deep_copy = copy(large_df),
            time_unit = "ms")
#> # A tibble: 2 x 6
#>   expression       min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>     <dbl>  <dbl>     <dbl> <bch:byt>    <dbl>
#> 1 shallow_copy  0.0979  0.124    8131.     1.51MB     17.0
#> 2 deep_copy    51.7    54.8        18.3  114.46MB     45.7

But as I mentioned above shallow copies only preserve the structure of your original data.table - namely the number of columns and the column names.

So what does this actually mean in practice? Through a lot of trial and error these are the examples I found:

Test 1 - Adding a new column

library(data.table)

shallow <- tidytable:::shallow

test_df <- data.table(x = c(1,2,3), y = c("a", "b", "c"))

shallow(test_df)[, double_x := x * 2][]
#>    x y double_x
#> 1: 1 a        2
#> 2: 2 b        4
#> 3: 3 c        6

test_df
#>    x y
#> 1: 1 a
#> 2: 2 b
#> 3: 3 c

See how the original test_df remains unchanged? That occurs because shallow copies are preserving the original structure of the data frame - 2 columns named "x" and "y".

Naturally we can use normal assignment to overwrite test_df:

test_df <- data.table(x = c(1,2,3), y = c("a", "b", "c"))

test_df <- shallow(test_df)[, double_x := x * 2][]

test_df
#>    x y double_x
#> 1: 1 a        2
#> 2: 2 b        4
#> 3: 3 c        6

I mentioned it also preserves column names - so let's try to use setnames() to rename the columns:

Test 2 - Renaming columns

test_df <- data.table(x = c(1,2,3), y = c("a", "b", "c"))

setnames(shallow(test_df), old = "x", new = "new_x")[]
#>    new_x y
#> 1:     1 a
#> 2:     2 b
#> 3:     3 c

test_df
#>    x y
#> 1: 1 a
#> 2: 2 b
#> 3: 3 c

As you can see the original test_df still has columns named "x" and "y".

So those are the two core examples. A shallow copy is really useful for these two situations as it prevents modify-by-reference and is much faster than data.table::copy().

Moving on from the core examples, here's an area where shallow copies sometimes work and sometimes doesn't - overwriting an existing column:

Test 3 - Overwriting an existing column with a newly calculated value

test_df <- data.table(x = c(1,2,3), y = c("a", "b", "c"))

shallow(test_df)[, x := x * 2][]
#>    x y
#> 1: 2 a
#> 2: 4 b
#> 3: 6 c

test_df
#>    x y
#> 1: 1 a
#> 2: 2 b
#> 3: 3 c

So that seems to work just like dplyr::mutate()! But here's a situation where it fails:

Test 4 - Overwriting an existing column with a single value

test_df <- data.table(x = c(1,2,3), y = c("a", "b", "c"))

shallow(test_df)[, x := 1][]
#>    x y
#> 1: 1 a
#> 2: 1 b
#> 3: 1 c

test_df
#>    x y
#> 1: 1 a
#> 2: 1 b
#> 3: 1 c

You can see that test_df$x has now been overwritten. So as handy as shallow copies are, they don't work in every situation.

There's a workaround to "Test 4" so you can still use shallow. It's what tidytable::mutate() uses. If you're curious about it let me know. But in general I think a good rule of thumb is to use deep copies unless you are 100% sure a shallow copy will work.

Random side note - rlang has its own version of data.table::copy() that has the ability to do both deep and shallow copies:

# deep copy
rlang::duplicate(test_df)

# shallow copy
rlang::duplicate(test_df, shallow = TRUE)

And the last thing to mention - the shallow() function I use was actually taken/borrowed from an old package. You can find the code for it in this file. If you decide you want to use it feel free to copy it from tidytable/this other package.

Hope this helps! Sorry this ended up longer than I intended. If you have any other questions let me know.

bnicenboim commented 4 years ago

This is a great answer thanks! A couple of questions more:

On Fri, Oct 23, 2020, 12:23 AM Mark Fairbanks notifications@github.com wrote:

I figured this might be easier for us to talk through any of your tidytable questions. (It's a bit easier to share code snippets).

Feel free to close the issue, but I figure we can keep discussing here either way.

why are you using your own shallow() function instead of just data.table::copy()? What's the advantage?

In R there are two types of copies - "shallow" copies and "deep" copies.

Deep copies copy/duplicate the entire object - this is what data.table::copy() does.

Shallow copies copy/preserve object structure. For data.tables this means it preserves the number of columns and the column names. However the columns in the data.table can still be overwritten.

In base R or the tidyverse you never have to worry about the difference between the two - they always "copy-on-modify" by applying whichever one is necessary depending on your command.

Deep copies are a bit easier to understand. Deep copies always prevent "modify-by-reference" of the underlying object, whereas shallow copies only sometimes do.

Seems like shallow copies aren't worth it - but they're much faster:

library(data.table) shallow <- tidytable:::shallow data_size <- 10000000large_df <- data.table(x = sample(1:5, data_size, TRUE), y = sample(c("a", "b", "c"), data_size, TRUE)) bench::mark(shallow_copy = shallow(large_df), deep_copy = copy(large_df), time_unit = "ms")#> # A tibble: 2 x 6#> expression min median itr/sec mem_alloc gc/sec#> #> 1 shallow_copy 0.0979 0.124 8131. 1.51MB 17.0#> 2 deep_copy 51.7 54.8 18.3 114.46MB 45.7

But as I mentioned above shallow copies only preserve the structure of your original data.table - namely the number of columns and the column names.

So what does this actually mean in practice? Through a lot of trial and error these are the examples I found:

Test 1 - Adding a new column

library(data.table) shallow <- tidytable:::shallow test_df <- data.table(x = c(1,2,3), y = c("a", "b", "c"))

shallow(test_df)[, double_x := x * 2][]#> x y double_x#> 1: 1 a 2#> 2: 2 b 4#> 3: 3 c 6 test_df#> x y#> 1: 1 a#> 2: 2 b#> 3: 3 c

See how the original test_df remains unchanged? That occurs because shallow copies are preserving the original structure of the data frame - 2 columns named "x" and "y".

Naturally we can use normal assignment to overwrite test_df:

test_df <- data.table(x = c(1,2,3), y = c("a", "b", "c")) test_df <- shallow(test_df)[, double_x := x * 2][] test_df#> x y double_x#> 1: 1 a 2#> 2: 2 b 4#> 3: 3 c 6

I mentioned it also preserves column names - so let's try to use setnames() to rename the columns:

Test 2 - Renaming columns

test_df <- data.table(x = c(1,2,3), y = c("a", "b", "c"))

setnames(shallow(test_df), old = "x", new = "new_x")[]#> new_x y#> 1: 1 a#> 2: 2 b#> 3: 3 c test_df#> x y#> 1: 1 a#> 2: 2 b#> 3: 3 c

As you can see the original test_df still has columns named "x" and "y".

So those are the two core examples. A shallow copy is really useful for these two situations as it prevents modify-by-reference and is much faster than data.table::copy().

Moving on from the core examples, here's an area where shallow copies sometimes work and sometimes doesn't - overwriting an existing column:

Test 3 - Overwriting an existing column with a newly calculated value

test_df <- data.table(x = c(1,2,3), y = c("a", "b", "c"))

shallow(test_df)[, x := x * 2][]#> x y#> 1: 2 a#> 2: 4 b#> 3: 6 c test_df#> x y#> 1: 1 a#> 2: 2 b#> 3: 3 c

So that seems to work just like dplyr::mutate()! But here's a situation where it fails:

Test 4 - Overwriting an existing column with a single value

test_df <- data.table(x = c(1,2,3), y = c("a", "b", "c"))

shallow(test_df)[, x := 1][]#> x y#> 1: 1 a#> 2: 1 b#> 3: 1 c test_df#> x y#> 1: 1 a#> 2: 1 b#> 3: 1 c

You can see that test_df$x has now been overwritten. So as handy as shallow copies are, they don't work in every situation.

Is there some logic on where they work and where they fail? Are there resources on that?

There's a workaround to "Test 4" so you can still use shallow. It's what tidytable::mutate() uses. If you're curious about it let me know. But in general I think a good rule of thumb is to use deep copies unless you are 100% sure a shallow copy will work.

Yes sure! I saw that you were using vctrs but I was trying to avoid this package. I got a bit feed up with all the recent changes on the tidy verse that made my package break. I'm trying to avoid everything but rlang. (I don't think I can avoid it). In any case I would like to know the logic. I'm dealing with super large objects in my package so anything that can speed things up is welcome.

Random side note - rlang has its own version of data.table::copy() that has the ability to do both deep and shallow copies:

deep copyrlang::duplicate(test_df)

shallow copyrlang::duplicate(test_df, shallow = TRUE)

And the last thing to mention - the shallow() function I use was actually taken/borrowed from an old package. You can find the code for it in this file https://github.com/openanalytics/gread/blob/master/R/internal-funs.R. If you decide you want to use it feel free to copy it from tidytable/this other package.

And so why are you using your own version? (You're already using rlang for other stuff anyway). Is your version different from duplicate with shallow =TRUE?

Hope this helps! Sorry this ended up longer than I intended. If you have any other questions let me know.

Yes! It helped! Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bnicenboim/eeguana/issues/139, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNUQ6RFM2SOCT2LEHNTSITSMCWEJANCNFSM4S3ZNRWA .

markfairbanks commented 4 years ago

Is there some logic on where they work and where they fail? Are there resources on that?

Unfortunately I haven't found a good resource on this. What I've found is mostly trial and error.

Yes sure! I saw that you were using vctrs but I was trying to avoid this package.

So let's revisit Tests 3 & 4.

Test 3 - Overwriting and existing column with a newly calculated value Test 4 - Overwriting an existing column with a single value

Test 3 - Overwriting an existing column with a newly calculated value

test_df <- data.table(x = c(1,2,3), y = c("a", "b", "c"))

shallow(test_df)[, x := x * 2][]
#>    x y
#> 1: 2 a
#> 2: 4 b
#> 3: 6 c

test_df
#>    x y
#> 1: 1 a
#> 2: 2 b
#> 3: 3 c

Test 4 - Overwriting an existing column with a single value

test_df <- data.table(x = c(1,2,3), y = c("a", "b", "c"))

shallow(test_df)[, x := 1][]
#>    x y
#> 1: 1 a
#> 2: 1 b
#> 3: 1 c

test_df
#>    x y
#> 1: 1 a
#> 2: 1 b
#> 3: 1 c

So as long as you are using a "calculated value" shallow copying can avoid modify-by-reference. So how do you make x := 1 a "calculated value"? You use a function like vec_recycle().

What does vec_recycle() do? In the case of test_df - we have a 3 row data frame. But technically the value 1 is only length 1. So it "recycles" 1 and turns it into a vector of length 3:

library(vctrs)

vec_recycle(1, 3)
#> [1] 1 1 1

So that idea brings us to Test 5. We'll build our own recycle() so that you can avoid using vctrs.

Test 5 - Overwriting an existing column with a recycled value

recycle <- function(x, size) {
  x_length <- length(x)

  if (x_length != 1 && x_length != size)
    stop(paste0("x must have length 1 or length ", size))

  if (x_length == 1) x <- rep(x, size)

  x
}

test_df <- data.table(x = c(1,2,3), y = c("a", "b", "c"))

shallow(test_df)[, x := recycle(1, .N)][]
#>    x y
#> 1: 1 a
#> 2: 1 b
#> 3: 1 c

test_df
#>    x y
#> 1: 1 a
#> 2: 2 b
#> 3: 3 c

And now test_df$x remains unmodified.

If you run into any more questions about tidytable::mutate.() code let me know. You should be able to rebuild it completely without using vctrs.

And so why are you using your own version? (You're already using rlang for other stuff anyway). Is your version different from duplicate with shallow =TRUE?

I didn't know rlang::duplicate() existed until a couple months ago 😂

I once tried to switch over to rlang::duplicate() and a bunch of my unit tests started failing. I haven't had the time to figure out why though. There is some difference, I'm just not positive what it is yet.

bnicenboim commented 4 years ago

this is great, thanks! I'll try to implement shallow for my objects and I'll see what happens

bnicenboim commented 3 years ago

Hi, I've been experiment a bit I've noticed that data table has not exposed function called shallow, that's 8 times faster than your version! I've checked and the same caveats apply. (rlang::duplicate shallow doesn't seem to be usable with data table). Is there a reason for not to use data.table:::shallow? (Besides that it's not exposed, but one can always just copy everything to the package).


 library(data.table)

 shallow <- tidytable:::shallow

 data_size <- 10000000
 large_df <- data.table(x = sample(1:5, data_size, TRUE),
                        y = sample(c("a", "b", "c"), data_size, TRUE))

 bench::mark(shallow_copy = shallow(large_df),
             shallow_dt = data.table:::shallow(large_df, cols = names(large_df)),
             deep_copy = copy(large_df),
             time_unit = "ms")
 # A tibble: 3 x 13
#  expression       min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory time 
#  <bch:expr>     <dbl>   <dbl>     <dbl> <bch:byt>    <dbl> <int> <dbl>      <dbl> <list> <list> <lis>
# 1 shallow_copy  0.0876  0.0967    9723.     16.1KB     8.40  4630     4       476. <data… <Rpro… <bch…
# 2 shallow_dt    0.0143  0.0156   58215.     16.1KB     5.82  9999     1       172. <data… <Rpro… <bch…
# 3 deep_copy    22.8    23.5         39.4   114.5MB    34.5      8     7       203. <data… <Rpro… <bch…
# … with 1 more variable: gc <list>
bnicenboim commented 3 years ago

And a second question, I noticed that this trick doesn't work with grouped data tables. Do you understand what going on? Is there a way to avoid copying the data table?

markfairbanks commented 3 years ago

Is there a reason for not to use data.table:::shallow?

Nope - if you'd rather use theirs you can. The one I use was just the one that I found first.

The biggest reason I haven't made the switch is tidytable:::shallow() is a pretty straightforward R function. data.table:::shallow() would require some extra effort since there are some calls to some of data.table's C functions as well.

And a second question, I noticed that this trick doesn't work with grouped data tables. Do you understand what going on? Is there a way to avoid copying the data table?

I'm not sure why this fails to be honest. I haven't been able to find a workaround. You'll see in tidytable::mutate.() the code runs this check if .by is provided:

needs_copy <- any(vec_in(names(dots), names(.df)))

if (needs_copy) .df <- copy(.df)

Basically saying - if any of the columns in the mutate.() call already exist in the data.table, create a copy. This check is unnecessary if there is no .by.