Open ramiromagno opened 2 years ago
After fiddling around, it seems that including
n <- nrow(snp_pairs)
# `setalloccol` is needed because of `future.apply::future_lapply()`,
# otherwise https://github.com/Rdatatable/data.table/issues/5376.
data.table::setalloccol(snp_pairs, extra_cols*n)
inside the source code of the mapped function, i.e. daeqtl_mapping()
makes the future_lapply()
call work, i.e. run without errors. However, it does not change the data table snp_pairs_lst2
in-place as lapply()
does with snp_pairs_lst1
.
I have no idea how the internals of future.apply
work but for parallel computing, you basically have to copy the objects you want to modify to your worker nodes.
This would at least explain the
This data.table has either been loaded from disk (e.g. using # readRDS()/load()) or constructed manually (e.g. using structure()).
Depending on how the serialization of future.apply
works there might be a way to provide custom serialization/deserialization although I'm not sure if that's really something future.apply
wants to achieve.
That the in-place change does not work after fixing the setalloccol
problem is also clear, since you are modifying the data.table
on your worker nodes and have to write them back at some point.
Also related to #5269 which caters for the call to setalloccol
.
Without a call to setalloccol()
I realize now that truelength(x)
returns 0
inside the mapped function. Introducing a call to setalloccol()
therein adds the right extra number of columns needed for set()
to work without problems.
If future.apply requires copy of data in your session then modification in-place will naturally not be possible. Unless you can pass a reference to an object I don't think there is a workaround for it. See related issues https://github.com/Rdatatable/data.table/issues/3104 and https://github.com/Rdatatable/data.table/issues/1336.
Author of futureverse here: FWIW, any type of parallel backends can be used in the future, e.g. forked parallelization via the mclapply()
framework, background R processes via PSOCK cluster, background R process via the callr package, etc. So, it's parallelization business as usual. This also means that one cannot make assumptions of running with shared memory or what type of serialization is used.
It sounds like the problem here is related to the general problem of serializing a data.table object and re-using it in another R process (concurrently or later in time).
May this issue be related to the fact that updating data.table
by reference using :=
inside a foreach
loop does not seem to work?
Yes, same problem if you run foreach in parallel. You can update a data.table in a parallel worker, but you cannot expect the update to be updated in the main R session.
Hi,
First of all, let me thank you for the development amazing
{data.table}
package.My case is that I have a list of data tables that I am trying to modify by reference with
data.table::set()
inside a loop usingfuture.apply::future_apply()
andfurrr::future_walk()
/furrr::future_map()
.However I am getting an error when using
future.apply::future_apply()
orfurrr::future_walk()
/furrr::future_map()
. It works fine withlapply()
although.I am not sure the problem is with the
{data.table}
package itself... I will post this same issue in{furrr}
and{future.apply}
Issues, and link it here.The error is:
You will need to install
{daeqtlr}
first: