Closed albert-ying closed 2 years ago
future_walk()
is essentially a wrapper around future_map()
that then just returns NULL
. So keep in mind that the internal future_map()
call is going to collect all of the results from each worker and return them back to the main process.
Since write_tsv()
actually returns its input, you are returning every single final_df
object back to the main process too (at the very least they can't be garbage collected).
Is it better if you do something like this?
future_walk(ListOfFilePath, ~{
large_df_2 = read_tsv(.x)
final_df = inner_join(large_df_1, large_df_2)
write_tsv(final_df, paste0(.x, "processed"))
NULL
})
Note to self that if that is the actual issue, I can probably fix it for the walk functions by adjusting this:
to be more like:
...furrr_fn_wrapper <- function(...) {
!!expr_seed_update
!!expr_progress_update
out <- ...furrr_fn(...)
!!expr_result
}
Where !!expr_result
would be out
if we aren't doing walk()
, and NULL
if we are to ensure that we only pass a list of NULL
back to the main process. I would have to add an is_walk
boolean argument, which should be straightforward.
Since
write_tsv()
actually returns its input, you are returning every singlefinal_df
object back to the main process too (at the very least they can't be garbage collected).
Oh, I didn't know that! This makes a lot of sense to me now. Thank you so much!!!
Actually, I still see RAM build-up in this setting. Any idea why?
library(furrr)
plan(multicore, workers = 8, gc = T)
options(future.globals.maxSize = 10 * 1024 ** 3)
# remove precessed rows
future_map(1:10000, ~{
df = read_tsv("large.tsv")
df2 = mutate(df, new_col = 1)
write_tsv(df2, "processed.tsv")
return(NULL)
})
Can you provide a full reprex that also generates the large tsv file?
Hi, this is an example that reproduces my actual workflow. In real life, I have 10k different files to be read and processed.
library(tidyverse)
library(furrr)
plan(multicore, workers = 8, gc = T)
options(future.globals.maxSize = 10 * 1024 ** 3)
system("wget https://broad-ukb-sumstats-us-east-1.s3.amazonaws.com/round2/additive-tsvs/100240.gwas.imputed_v3.both_sexes.tsv.bgz -O gwas.tsv.bgz")
system("wget https://broad-ukb-sumstats-us-east-1.s3.amazonaws.com/round2/annotations/variants.tsv.bgz -O variants.tsv.bgz")
df = read_tsv("variants.tsv.bgz", col_types = "ccdccc---d--d------------")
# remove precessed rows
future_map(1:10000, ~{
gwas = read_tsv("gwas.tsv.bgz")
gwas2 = gwas |>
inner_join(df, by = "variant")
write_tsv(gwas, "test.tsv")
return(NULL)
})
You can see the RAM usage is quickly doubled and keep building up.
Is this on a linux? Also, how much RAM do you actually have available on your computer?
Yes it is. We have 251 Gb, but this code will eventually occupy all the RAM and crash the machine (as I ran it overnight last night)
I think you can still get similar behavior with a smaller dataset.
Best, Albert
On Mon, Jan 24, 2022 at 3:44 PM Davis Vaughan @.***> wrote:
Is this on a linux? Also, how much RAM do you actually have available on your computer?
— Reply to this email directly, view it on GitHub https://github.com/DavisVaughan/furrr/issues/205#issuecomment-1020530364, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOIS5MX27IT4TYK5BNNEFUTUXW23PANCNFSM5FA63VIA . You are receiving this because you authored the thread.Message ID: @.***>
Oh and how many physical cores?
72, here I used 8.
I tried with 2 workers on my 32gb machine and I can't reproduce.
It is very possible that you are having an issue with the combination of furrr's parallelism and readr/vroom's parallelism.
Are you aware that both read_tsv()
and write_tsv()
run in parallel? And by default they use the same number of threads as number of virtual cores that you have? This means that you are running over 8 cores using furrr, and within that each one of those cores is trying to read/write with 144 threads (72 physical * 2 = 144 virtual, try readr:::readr_threads()
to confirm). This is probably overloading your system in some way.
You could try setting num_threads = 1
in both readr functions. That would probably help? We generally do not recommend mixing parallelism like this.
Hi Davis, I just tried to set all read* and write* functions with num_threads = 1
. It seems that the RAM usage still builds up over time like before. Did you also used readr
functions in your test?
This problem seems to be resolved by replacing readr
functions with base R functions. Still not sure what is happening under the hood.
Oh really? That's interesting.
Did you also used readr functions in your test?
I did, and tried multiple combinations of num_threads
but couldn't really reproduce on my end
What version of readr are you using? And what version of vroom?
I was using readr 2.0.1 and vroom 1.5.7. Maybe it is related to lazy reading behavior? But I'm pretty happy that base R functions do not have this problem :)
Oh yea you have an older version of readr. In readr 2.1.0 we switched back to reading eagerly by default. So maybe that is why I can't reproduce.
Can you try either:
lazy = FALSE
explicitly (and num_threads = 1
)Even if 2 doesn't solve the problem, I'd love to hear if upgrading readr helps because there have been a number of fixes between your version and now
It would also be very useful to know which step had the potential memory leak. i.e. does the memory grow if you do this?
future_map(1:10000, ~{
gwas = read_tsv("gwas.tsv.bgz")
return(NULL)
})
And if you replace future_map()
with map()
, does the memory usage still grow? That would 100% rule out furrr, which I'm already pretty confident we can do since the base R reading functions work.
Hi,
I'm using
future_walk
for a simple task:Join a list of large tables with a large table in the memory.
At beginning, the RAM usage is arround 50G, which is acceptable. But with the time, the RAM usage gradually increased to ~ 250G and finally fill all of the RAM.
Now I have to manually stop the process after it ran for a while and restart to avoid the crash. I'm wondering what may cause this and how can I prevent it?
Thank you!