Closed naglemi closed 4 years ago
This revealed a separate bug, which I've captured in https://github.com/DavisVaughan/furrr/issues/113. You might try using future.apply::future_lapply()
in the meantime to see if that fixes it.
However, I generally think what you are describing here is expected behavior. future.globals.maxSize
should be checked on each element of your window_list
. If any one of those individual elements is larger than future.globals.maxSize
, you should get an error even on multicore. I imagine this should happen for consistency across all future backends. Does this sound right @HenrikBengtsson?
Thank you for your response. You mention that future.globals.maxSize
is checked on each element of window_list
and that any one of these elements being larger than future.globals.maxSize
could lead to this error. However, in my case window_list
is only a few Mb and the large dataset is passed in the variable large_data
while window_list
only provides indices to extract from large_data
. No extracted portion of large_data
is larger than a few Mb either. Please let me know if this changes your assessment. Anyway, I’ll try using future.apply::future_lappy()
and see if that works.
However, I generally think what you are describing here is expected behavior. future.globals.maxSize should be checked on each element of your window_list. If any one of those individual elements is larger than future.globals.maxSize, you should get an error even on multicore. I imagine this should happen for consistency across all future backends. Does this sound right @HenrikBengtsson?
Yes, the protection against exporting too-large objects should be per element processed, i.e. scaled by the number of elements per worker (as in https://github.com/HenrikBengtsson/future.apply/blob/develop/R/future_xapply.R#L171-L181).
FYI, it's on my not-too-far roadmap to create a 'future.{chunks,mapreduce,...)' package that will provide a common API to serve futurized map-reduce packages like future.apply, furrr, doFuture, ... That should help harmonize behaviors like this one.
Oh very nice! I'll finally be working on furrr again in the nearish future as well, so I'll keep that in mind
Nevertheless, @naglemi I don't think that changes my answer. For consistency between backends, large_data
being larger than future.globals.maxSize
should throw an error (that is really a future question, not a furrr one)
Thank you, @DavisVaughan. In this case it sounds like my best option if using furrr or future.apply is to create a list containing the desired overlapping portions of large_data
and use this as the x argument. I was hoping to avoid this because of the redundancy that will exist in this data structure due to overlap, but it I suppose it may not be easily avoidable.
This memory overload is also happening with the equivalent future.apply::future_lapply
call, so I think I need to pre-allocate the data structure either way.
I think I ran into something similar: having a huge initial dataset, yet each nested item would be of small size. Running future_map
on each small dataset, I expected memory requirement would be small, but it would quickly use a lot of memory (seemingly multiplying the size of the huge dataset by number of cores?) .
Is this expected behavior? If not I am happy to provide a reprex trying to clarify the point?
future.globals.maxSize
now scales according to the chunk size, which was the issue adjacent to this one.
I think you'll need to break up that large object so only the relevant pieces get exported to the workers. Exporting that large object is generally not a good thing to do anyways, because it is going to be extremely slow. That is part of the reason the futures global option is in place.
Sorry for the trivial question, but now I have some doubt?
If I have a tibble with 10 rows, and one column, with each element being 100Mb. The tibble is therefore 1Gb.
if I do
plan(multisession, workers = 10)
my_tibble |>
mutate(new_column = furrr::future_map(old_column, my_function))
Each worker will load 100Mb, of each worker will load 1Gb?
would .env_globals=environment()
help. Does the environment mainly imports packages? Why should I need the local environment, if all my variables are locally defined and don't depend on global variables.
First, thank you for making the furrr library available!
I am attempting a parallel task in which a large data object of 11GB is broken down into pieces and different pieces are analyzed by different cores. I'm working with 24 cores, but not enough memory to make 24 copies of the object. My understanding was that since I specify plan('multicore'), the cores should use shared memory rather than copying the objects. Is this expected behavior or is there something wrong? I apologize if the former and I misunderstand how the multicore mode is supposed to work. How can I utilize future_map for my task without copying the large object?
I am running CentOS v7. Below is relevant code and the resulting error showing that the system is attempting to duplicate the object rather than keeping it in shared memory.
To clarify how the above function works, the vector window_list is used to break large_data down into overlapping windows, which are tested individually.