Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.62k stars 986 forks source link

Error on modifying by reference with `data.table::set()` in the context of `future.apply::future_apply()` or `furrr::future_map()` #5376

Open ramiromagno opened 2 years ago

ramiromagno commented 2 years ago

Hi,

First of all, let me thank you for the development amazing {data.table} package.

My case is that I have a list of data tables that I am trying to modify by reference with data.table::set() inside a loop using future.apply::future_apply() and furrr::future_walk()/furrr::future_map().

However I am getting an error when using future.apply::future_apply() or furrr::future_walk()/furrr::future_map(). It works fine with lapply() although.

I am not sure the problem is with the {data.table} package itself... I will post this same issue in {furrr} and {future.apply} Issues, and link it here.

The error is:

Error in data.table::set(snp_pairs, i = i, j = col, value = df[[col]]) : 
  This data.table has either been loaded from disk (e.g. using readRDS()/load()) or constructed manually (e.g. using structure()). Please run setDT() or setalloccol() on it first (to pre-allocate space for new columns) before assigning by reference to it.

You will need to install {daeqtlr} first:

remotes::install_github("maialab/daeqtlr")
library(future.apply)
library(furrr)
# For now install from https://github.com/maialab/daeqtlr
library(daeqtlr)

plan(multisession)

snp_pairs <- read_snp_pairs(file = daeqtlr_example("snp_pairs.csv"))
zygosity <- read_snp_zygosity(file = daeqtlr_example("zygosity.csv"))
ae <- read_ae_ratios(file = daeqtlr_example("ae.csv"))

no_cores <- 6L
indices <- seq_len(nrow(snp_pairs))
partitioning_factor <- sort((indices)%%no_cores) + 1
snp_pairs_lst1 <- split(snp_pairs, partitioning_factor)
snp_pairs_lst2 <- split(snp_pairs, partitioning_factor)
snp_pairs_lst3 <- split(snp_pairs, partitioning_factor)

for( i in seq_along(snp_pairs_lst1)) {
  data.table::setkeyv(snp_pairs_lst1[[i]], 'dae_snp')
  data.table::setkeyv(snp_pairs_lst2[[i]], 'dae_snp')
  data.table::setkeyv(snp_pairs_lst3[[i]], 'dae_snp')
}

# Runs fine without errors.
lapply(snp_pairs_lst1,
              FUN = daeqtl_mapping,
              zygosity = zygosity,
              ae = ae)

# Fails with error:
# 
# Error in data.table::set(snp_pairs, i = i, j = col, value =
# df[[col]]) : This data.table has either been loaded from disk (e.g. using
# readRDS()/load()) or constructed manually (e.g. using structure()). Please run
# setDT() or setalloccol() on it first (to pre-allocate space for new columns)
# before assigning by reference to it.
future_lapply(snp_pairs_lst2,
              FUN = daeqtl_mapping,
              zygosity = zygosity,
              ae = ae)

# Fails with the same error as `future_lapply`
# It won't work with `future_map` either.
future_walk(snp_pairs_lst3,
              .f = daeqtl_mapping,
              zygosity = zygosity,
              ae = ae)
ramiromagno commented 2 years ago

After fiddling around, it seems that including

  n <- nrow(snp_pairs)
  # `setalloccol` is needed because of `future.apply::future_lapply()`,
  # otherwise https://github.com/Rdatatable/data.table/issues/5376.
  data.table::setalloccol(snp_pairs, extra_cols*n)

inside the source code of the mapped function, i.e. daeqtl_mapping() makes the future_lapply() call work, i.e. run without errors. However, it does not change the data table snp_pairs_lst2 in-place as lapply() does with snp_pairs_lst1.

ben-schwen commented 2 years ago

I have no idea how the internals of future.apply work but for parallel computing, you basically have to copy the objects you want to modify to your worker nodes. This would at least explain the

This data.table has either been loaded from disk (e.g. using # readRDS()/load()) or constructed manually (e.g. using structure()).

Depending on how the serialization of future.apply works there might be a way to provide custom serialization/deserialization although I'm not sure if that's really something future.apply wants to achieve.

That the in-place change does not work after fixing the setalloccol problem is also clear, since you are modifying the data.table on your worker nodes and have to write them back at some point.

ben-schwen commented 2 years ago

Also related to #5269 which caters for the call to setalloccol.

ramiromagno commented 2 years ago

Without a call to setalloccol() I realize now that truelength(x) returns 0 inside the mapped function. Introducing a call to setalloccol() therein adds the right extra number of columns needed for set() to work without problems.

jangorecki commented 2 years ago

If future.apply requires copy of data in your session then modification in-place will naturally not be possible. Unless you can pass a reference to an object I don't think there is a workaround for it. See related issues https://github.com/Rdatatable/data.table/issues/3104 and https://github.com/Rdatatable/data.table/issues/1336.

HenrikBengtsson commented 1 year ago

Author of futureverse here: FWIW, any type of parallel backends can be used in the future, e.g. forked parallelization via the mclapply() framework, background R processes via PSOCK cluster, background R process via the callr package, etc. So, it's parallelization business as usual. This also means that one cannot make assumptions of running with shared memory or what type of serialization is used.

It sounds like the problem here is related to the general problem of serializing a data.table object and re-using it in another R process (concurrently or later in time).

iago-pssjd commented 1 year ago

May this issue be related to the fact that updating data.table by reference using := inside a foreach loop does not seem to work?

HenrikBengtsson commented 1 year ago

Yes, same problem if you run foreach in parallel. You can update a data.table in a parallel worker, but you cannot expect the update to be updated in the main R session.