apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.6k stars 3.54k forks source link

[R] Memory usage in R blows up #31179

Closed asfimport closed 2 years ago

asfimport commented 2 years ago

Hi,

I'm trying to load a ~10gb arrow file into R (under Windows)

(The file is generated in the 6.0.1 arrow version under Linux).

For whatever reason the memory usage blows up to ~110-120gb (in a fresh and empty R instance).

The weird thing is that when deleting the object again and running a gc() the memory usage goes down to 90gb only. The delta of ~20-30gb is what I would have expected the dataframe to use up in memory (and that's also approx. what was used - in total during the load - when running the old arrow version of 0.15.1. And it is also what R shows me when just printing the object size.)

The commands I'm running are simply:

options(arrow.use_threads=FALSE);

arrow::set_cpu_count(1); # need this - otherwise it freezes under windows

arrow::read_arrow('file.arrow5')

Is arrow reserving some resources in the background and not giving them up again? Are there some settings I need to change for this?

Is this something that is known and fixed in a newer version?

Note that this doesn't happen in Linux. There all the resources are freed up when calling the gc() function - not sure if it matters but there I also don't need to set the cpu count to 1.

Any help would be appreciated.

Reporter: Will Jones / @wjones127 Assignee: Will Jones / @wjones127

Related issues:

Note: This issue was originally created as ARROW-15730. Please see the migration documentation for further details.

asfimport commented 2 years ago

Will Jones / @wjones127: Hi Christian,

That's very odd. Could you check these two things to help us identify the issue?

  1. Show us the output of {}arrow::arrow_info(){}. I'm particularly interested check which allocator/memory pool is being used. You can change that with the ARROW_DEFAULT_MEMORY_POOL environment variable
  2. Does this issue happen if you pass in as_data_frame = FALSE into arrow::read_arrow()? That could help us determine if it's due to the file read or the conversion to an R data frame.
asfimport commented 2 years ago

Christian: Apologies for the late reply. I just checked it after a full computer restart and it is exactly the same problem. Interestingly this time the full memory usage went to 90gb and then after deleting+gc() it got stuck at 60gb. So same problem - just a little bit lower total numbers. It holds both in Rstudio and in a R terminal.

Below the requested outputs. I also added what it shows on a gc() and what windows shows as resource usage.

This happens with as_data_frame=T (default setting of read_arrow), given that I don't need to make any changes to the df when loading it in.

And to reiterate - under Linux it frees up all resources after calling gc().

arrow::arrow_info() Arrow package version: 6.0.1

Capabilities:

dataset    TRUE parquet    TRUE json       TRUE s3         TRUE utf8proc   TRUE re2        TRUE snappy     TRUE gzip       TRUE brotli    FALSE zstd       TRUE lz4        TRUE lz4_frame  TRUE lzo       FALSE bz2       FALSE jemalloc  FALSE mimalloc   TRUE

Arrow options():

arrow.use_threads FALSE

Memory:

Allocator mimalloc Current    0 bytes Max       34.31 Gb

Runtime:

SIMD Level          avx512 Detected SIMD Level avx512

Build:

C++ Library Version                                     6.0.1 C++ Compiler                                              GNU C++ Compiler Version                                    8.3.0 Git ID               d132a740e33ec18c07b8718e15f85b4080a292ff

gc()           used (Mb) gc trigger    (Mb)   max used    (Mb) Ncells 1792749 95.8    3428368   183.1    2914702   155.7 Vcells 4673226 35.7 2939373019 22425.7 3943230076 30084.5

ls() character(0)

 

image-2022-02-19-09-05-32-278.png

 

 

asfimport commented 2 years ago

Christian: Also if you want me to test the problem with a specific arrow5 file I'm happy to do so.

asfimport commented 2 years ago

Christian: I did some more testing (all the reading is done within R and Arrow 6.0.1). It looks like there's a few things here:

1) I read a file that was written in Arrow 5 (the file is {}30{}gb and was written directly with the C#/C++ interface) - that one increases the memory usage to ~30-38. But then on gc() the memory usage goes down to 8gb and doesn't free up everything. I'm not sure why that is but that's acceptable. The file only has chr/Date/num/int. Calling arrow_info yields the following (same result after loading/deleting the df).

Allocator mimalloc Current    0 bytes Max        0 bytes

2) Reading the file from last week ({}10{}gb written in Arrow 6.0.1 from R) yields again the same result as last week. Note that here I have also the factor/logical types which arrow seems to store and read.

Allocator mimalloc Current    4.19 Kb Max       34.31 Gb

3) As a test I did a write_arrow on the file from 2), but I did an unfactor on all the factor columns. Same issue as in 2). So it doesn't look like it is the factor type that's the issue.

4) As a final test I read the file from 1) and did a write_arrow on it from R. The issue comes up again after reading it back in.

Before deletion:

Allocator mimalloc Current    28.2 Gb Max        28.2 Gb

After deletion:

Allocator mimalloc Current    0 bytes Max        28.2 Gb

 

So the issue seems to be with writing the arrow file from R. All I do is to call a write_arrow('file.arrow5'). Is there a problem with that?

asfimport commented 2 years ago

Christian: As one final test I wrote the arrow file in 4 different ways:

c("default", "lz4", "uncompressed", "zstd") %>% walk(~{   log_info(.x)   write_feather(     testdf,     glue('C:/Temp/dftest{.x}.arrow'),     compression = .x   ) })

 

It seems that only when writing it uncompressed it does not have the memory issue - then it behaves as expected:

asfimport commented 2 years ago

Will Jones / @wjones127: Hi Christian,

Did you do any tests with as_data_frame = FALSE?

There are two separate stores of memory: R's memory and Arrow's memory pool (on Windows defaults to mimalloc). arrow_info prints out the stats for the Arrow memory pool; it looks like it's freeing things correctly, right? When you call {}gc(){}, that affects R's memory system. It sounds like R isn't freeing memory there correctly. Testing with as_data_frame = FALSE would help confirm this.

If what I've said above is correct, it seems like either there might be a bug in the R for Windows, or there is something wrong in Arrow. If you could describe the data a little more (or even share a sample), perhaps I could reproduce it?

 

 

asfimport commented 2 years ago

Christian: I have not tried that yet but will do so soon.

As said reading a non-compressed version seems to have worked - not sure if that indicates what the issue might be.

Unfortunately I can't share the data set but given that I was able to reproduce it with 2 different data sets I don't think it has something to do with the data itself. If you have a reference arrow file you want me to download and that we both can test I'm very happy to do so.

asfimport commented 2 years ago

Will Jones / @wjones127: [~Klar] to be more specific, could you share the output of the following:


options(arrow.use_threads=FALSE);

arrow::set_cpu_count(1); # need this - otherwise it freezes under windows

table <- arrow::read_arrow('file.arrow5', as_data_frame = FALSE)

arrow_info()$memory
gc()
table$schema

 

asfimport commented 2 years ago

Christian: Yes that should be fine.

asfimport commented 2 years ago

Christian: Note that this is with a cut down version. Here the size of the file is ~1gb and is written with the "default" compression. It takes about 3-5gb when reading it into R. And the space that doesn't get freed up is ~7gb.

(Deleting table and running another gc() keeps the 7gb allocated.)

 

arrow_info()$memory $backend_name [1] "mimalloc"

$bytes_allocated [1] 5379819648

$max_memory [1] 5379819648

$available_backends [1] "mimalloc" "system"  

gc()           used  (Mb) gc trigger  (Mb) max used  (Mb) Ncells 2625778 140.3    4439937 237.2  4439937 237.2 Vcells 7082576  54.1   12255594  93.6  9236142  70.5 table$schema Schema : int32 : date32[day] : string : string string date32[day] string string double double double int32 int32 string double string double double string double double string string double string bool date32[day] string string string string string string int32 int32 int32 int32 int32 int32 string int32 string int32 string string string string string string string string string string string string string string string string string double double int32 double date32[day] dictionary dictionary dictionary dictionary dictionary dictionary dictionary dictionary dictionary dictionary dictionary dictionary dictionary dictionary dictionary dictionary dictionary dictionary dictionary dictionary dictionary dictionary

asfimport commented 2 years ago

Will Jones / @wjones127: Not able to reproduce yet, but I am noticing the memory usage reporting is a little confusing in Rstudio. To make sure we are on the same page, my test code is below.

  1. Does this align with how you are measuring memory use? Or is there somewhere else?

  2. Does this example data show the memory leaking behavior you are seeing?   {code:r} library(arrow)

    print_memory <- function() { print(sprintf("Arrow: %s MB", trunc(arrow_info()$memory$bytes_allocated / 1024 / 1024))) print(sprintf("R: %s MB", gc()["Vcells", 2])) }

  3. Create example data size <- 1E8

    my_table <- arrow_table( x = Array$create(sample(letters, size, replace = TRUE)), y = Array$create(as.factor(sample(letters, size, replace = TRUE))), z = Array$create(as.Date(1:size, as.Date("2020-01-01"))), a = Array$create(1:size, type=int32()) )

    arrow::write_arrow(my_table, "file.arrow5") remove(my_table)

  4. Note: you may need to wait a few seconds for Arrow memory pool to free memory print_memory()

  5. [1] "Arrow: 0 MB"

  6. [1] "R: 14.1 MB"

    options(arrow.use_threads=FALSE);

    arrow::set_cpu_count(1); # need this - otherwise it freezes under windows

    table <- arrow::read_arrow('file.arrow5') print_memory()

  7. [1] "Arrow: 1335 MB"

  8. [1] "R: 1158.5 MB"

    remove(table) print_memory()

  9. [1] "Arrow: 0 MB"

  10. [1] "R: 14.1 MB"{code}

asfimport commented 2 years ago

Christian: Yes this is enough to reproduce the issue. I ran it.

asfimport commented 2 years ago

Jameel Alsalam: Hello, I think I have reproduced the issue here. About 1.5 GB appears to still be in use after the remove statement. I am on CRAN arrow 7.0.0. I was interested in this issue because I have tried to diagnose a different arrow memory issue involving write_dataset. In my investigations, the memory reported internally by gc() or arrow is quite different than what is reported by Windows via e.g., task manager. I have found a way to get the system task manager-like memory by running: system2("tasklist", stdout=TRUE) and then filtering for the right process. Pasted below I ran your script with the additional memory info.

 

library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp

print_memory <- function() {
  print(sprintf("Arrow: %s MB", trunc(arrow_info()$memory$bytes_allocated / 1024 / 1024)))
  print(sprintf("R: %s MB", gc()["Vcells", 2]))
  print((function(t) t[grep(Sys.getpid(), t)])(system2("tasklist", stdout = TRUE)))
}

1. Create example data
   size <- 1E8

   print_memory()
#> [1] "Arrow: 0 MB"
#> [1] "R: 9.8 MB"
#> [1] "Rterm.exe                    19096 Console                    2    125,672 K"

my_table <- arrow_table(
  x = Array$create(sample(letters, size, replace = TRUE)),
  y = Array$create(as.factor(sample(letters, size, replace = TRUE))),
  z = Array$create(as.Date(1:size, as.Date("2020-01-01"))),
  a = Array$create(1:size, type=int32())
)

arrow::write_arrow(my_table, "file.arrow5")
#> Warning: Use 'write_ipc_stream' or 'write_feather' instead.
remove(my_table)

1. Note: you may need to wait a few seconds for Arrow memory pool to free memory
   Sys.sleep(5)
   print_memory()
#> [1] "Arrow: 953 MB"
#> [1] "R: 392.6 MB"
#> [1] "Rterm.exe                    19096 Console                    2    563,344 K"

options(arrow.use_threads=FALSE);

arrow::set_cpu_count(1); # need this - otherwise it freezes under windows

table <- arrow::read_arrow('file.arrow5')
#> Warning: Use 'read_ipc_stream' or 'read_feather' instead.
print_memory()
#> [1] "Arrow: 1335 MB"
#> [1] "R: 1156.2 MB"
#> [1] "Rterm.exe                    19096 Console                    2  2,709,252 K"

remove(table)
Sys.sleep(5)
print_memory()
#> [1] "Arrow: 858 MB"
#> [1] "R: 11.8 MB"
#> [1] "Rterm.exe                    19096 Console                    2  1,534,436 K"

Created on 2022-02-22 by the reprex package (v2.0.1)

Session info ``` r sessioninfo::session_info() #> - Session info --------------------------------------------------------------- #>  setting  value #>  version  R version 4.0.5 (2021-03-31) #>  os       Windows 10 x64 (build 19042) #>  system   x86_64, mingw32 #>  ui       RTerm #>  language (EN) #>  collate  English_United States.1252 #>  ctype    English_United States.1252 #>  tz       America/New_York #>  date     2022-02-22 #>  pandoc   2.11.4 @ C:/Program Files/RStudio/bin/pandoc/ (via rmarkdown) #>  #> - Packages ------------------------------------------------------------------- #>  ! package     \* version date (UTC) lib source #>    arrow       \* 7.0.0   2022-02-10 [1] CRAN (R 4.0.5) #>  P assertthat    0.2.1   2019-03-21 [?] CRAN (R 4.0.5) #>  P backports     1.4.1   2021-12-13 [?] CRAN (R 4.0.5) #>  P bit           4.0.4   2020-08-04 [?] CRAN (R 4.0.5) #>  P bit64         4.0.5   2020-08-30 [?] CRAN (R 4.0.5) #>  P cli           3.2.0   2022-02-14 [?] CRAN (R 4.0.5) #>  P crayon        1.5.0   2022-02-14 [?] CRAN (R 4.0.5) #>  P digest        0.6.29  2021-12-01 [?] CRAN (R 4.0.5) #>  P ellipsis      0.3.2   2021-04-29 [?] CRAN (R 4.0.5) #>  P evaluate      0.14    2019-05-28 [?] CRAN (R 4.0.5) #>  P fansi         1.0.2   2022-01-14 [?] CRAN (R 4.0.5) #>  P fastmap       1.1.0   2021-01-25 [?] CRAN (R 4.0.5) #>  P fs            1.5.2   2021-12-08 [?] CRAN (R 4.0.5) #>  P glue          1.6.1   2022-01-22 [?] CRAN (R 4.0.5) #>  P highr         0.9     2021-04-16 [?] CRAN (R 4.0.5) #>  P htmltools     0.5.2   2021-08-25 [?] CRAN (R 4.0.5) #>  P knitr         1.37    2021-12-16 [?] CRAN (R 4.0.5) #>  P lifecycle     1.0.1   2021-09-24 [?] CRAN (R 4.0.5) #>  P magrittr      2.0.2   2022-01-26 [?] CRAN (R 4.0.5) #>  P pillar        1.7.0   2022-02-01 [?] CRAN (R 4.0.5) #>  P pkgconfig     2.0.3   2019-09-22 [?] CRAN (R 4.0.5) #>  P purrr         0.3.4   2020-04-17 [?] CRAN (R 4.0.5) #>    R.cache       0.15.0  2021-04-30 [2] CRAN (R 4.0.5) #>    R.methodsS3   1.8.1   2020-08-26 [2] CRAN (R 4.0.3) #>    R.oo          1.24.0  2020-08-26 [2] CRAN (R 4.0.3) #>    R.utils       2.11.0  2021-09-26 [2] CRAN (R 4.0.5) #>  P R6            2.5.1   2021-08-19 [?] CRAN (R 4.0.5) #>  P reprex        2.0.1   2021-08-05 [?] CRAN (R 4.0.5) #>  P rlang         1.0.1   2022-02-03 [?] CRAN (R 4.0.5) #>  P rmarkdown     2.11    2021-09-14 [?] CRAN (R 4.0.5) #>  P rstudioapi    0.13    2020-11-12 [?] CRAN (R 4.0.5) #>  P sessioninfo   1.2.2   2021-12-06 [?] CRAN (R 4.0.5) #>  P stringi       1.7.6   2021-11-29 [?] CRAN (R 4.0.5) #>  P stringr       1.4.0   2019-02-10 [?] CRAN (R 4.0.5) #>    styler        1.6.2   2021-09-23 [2] CRAN (R 4.0.5) #>  P tibble        3.1.6   2021-11-07 [?] CRAN (R 4.0.5) #>  P tidyselect    1.1.2   2022-02-21 [?] CRAN (R 4.0.5) #>  P utf8          1.2.2   2021-07-24 [?] CRAN (R 4.0.5) #>  P vctrs         0.3.8   2021-04-29 [?] CRAN (R 4.0.5) #>  P withr         2.4.3   2021-11-30 [?] CRAN (R 4.0.5) #>  P xfun          0.29    2021-12-14 [?] CRAN (R 4.0.5) #>  P yaml          2.3.5   2022-02-21 [?] CRAN (R 4.0.5) #>  #>  [1] C:/Users/jalsal02/R/renv/library/arrow-nightly-d7265b80/R-4.0/x86_64-w64-mingw32 #>  [2] C:/Users/jalsal02/R/dev-library/4.0 #>  [3] C:/Program Files/R/R-4.0.5/library #>  #>  P – Loaded and on-disk path mismatch. #>  #> ------------------------------------------------------------------------------ ```
asfimport commented 2 years ago

Jameel Alsalam: I'm sorry for the garbled output. I am rendering using the reprex package, but I can't figure out how to make it look nice in JIRA.

asfimport commented 2 years ago

Will Jones / @wjones127: Yes I think I can reproduce this. Essentially, R and Arrow report freeing that memory, but the OS is reporting that memory as still used. I think this is actually expected behavior for the underlying memory pools; they tend to not release memory very aggressively with the expectation that it will reuse it.

If you instead use the system allocator, you should see this issue go away (I did when I tested on my local):


Sys.setenv(ARROW_DEFAULT_MEMORY_POOL="system") # You must run this before library(arrow)
library(arrow)
arrow_info()$memory$backend_name
# [1] "system"

However, depending on your application you probably don't want to do this. While it may appear to use more memory, it's not necessarily the case that the way the allocator is handling memory is worse. See these discussions:

asfimport commented 2 years ago

Christian: Got it - thank you. I will try it with "system".

asfimport commented 2 years ago

Christian: I have a conceptual question though: Does the current setup create a "copy" of the file within the arrow memory (meaning does it read the entire file into arrow, and then loads it into R)? Because for large data.frames the double counting would be an issue?

 

And additionally even if isn't fully a memory leak it seems that once I delete the object, and then load another one, the space isn't used at all - Arrow is reserving more/incremental memory. So the system just starts running out of space.

asfimport commented 2 years ago

Will Jones / @wjones127:

I have a conceptual question though: Does the current setup create a "copy" of the file within the arrow memory (meaning does it read the entire file into arrow, and then loads it into R)? Because for large data.frames the double counting would be an issue? Yes that's right. Basically all the Arrow readers will first read the data into Arrow format, and then if you asked for it as an R dataframe, will "convert" to a data frame. Now some of the conversion is really cheap; I think simple vectors like integer and numeric are essentially zero copy. IIRC there are some like structs that require more transformation to become their R analogue.

asfimport commented 2 years ago

Will Jones / @wjones127:

And additionally even if isn't fully a memory leak it seems that once I delete the object, and then load another one, the space isn't used at all - Arrow is reserving more/incremental memory. So the system just starts running out of space. Then that sounds like you might be encountering the bug from mimalloc I mentioned, which out of scope for this Arrow unfortunately. I think I'll create a new issue to look at mimalloc V2 which supposedly doesn't have this problem; according to the author the only reason not to use it is that is that some users report performance regressions in moving from v1 to v2.

asfimport commented 2 years ago

Christian: Okay thanks. As said I will try it with "system" and see if it goes better. I'm not sure if I will be able to test it before Sunday but it would be great if you could leave this issue open for the case that I find some other problems.

asfimport commented 2 years ago

Christian: Three additional questions:

asfimport commented 2 years ago

Will Jones / @wjones127:

Did the memory model (i.e. keeping a copy within arrow) change after 0.15 or was it introduced afterwards? That was the previous version I was  using and I never had these kind of memory "issues" with it (understood that issues is not necessarily the right word). The double counting just seems very punitive. No, as far as I know it should have gotten better since then, with fewer copies made. This has been done by implementing "altrep" conversions to R vectors, which allows R to use the existing Arrow array memory instead of copying data. For each new version we've implemented this for additional data types. For example, here's the PR for integers and numerics vectors: https://github.com/apache/arrow/pull/10445.

However, I just noticed that altrep was just implemented for ChunkedArray in 7.0.0, so you might not be getting the full benefit in 6.0.1 (since your 30GB file is most likely made up of multiple chunks). So it is likely worth retrying in 7.0.0.

I just tried "system" and it does free it up (as you said) but for a while R is using about 70gb when the actual object size within R is just 30gb. It's hard to do any computation (read, aggregate, write, whatever) without creating some sort of intermediate result. For a 30GB file, that sounds pretty normal. You saying you can measure lower peak memory use in Arrow 0.15? Do you know if R Factors (show up as dictionary in the table schema) are especially punitive compared to strings? It's hard to say, and I think depends on your R version. But in 6.0.1 altrep was implemented for strings, and it won't be implemented for factors until 8.0.0. I think the best thing to do would be to save a file with just a string or just a factor and then test the peak vs result memory of each. Were you able to set Sys.setenv(ARROW_DEFAULT_MEMORY_POOL="system") within Rstudio? I tried a few different ways and it always just shows me mimalloc. It does work in a R console window (which is where I did the above test). I even completely restarted Rstudio - whatever I do it stays at mimalloc.  

Yes I tested in Rstudio. Make sure to do Session > Restart R before you do this.

 

asfimport commented 2 years ago

Christian: Please ignore below - It still isn't working but I just put it into the Rprofile for now and that actually seems to set it correctly.

 

Thanks.

So I tried pretty much everything in Rstudio but no dice. I'm literally running the following commands right away when starting RStudio - but it always comes back with mimalloc.

Sys.setenv(ARROW_DEFAULT_MEMORY_POOL="system") Sys.getenv('ARROW_DEFAULT_MEMORY_POOL') library(arrow) arrow_info()$memory$backend_name

 

Within a normal R console it works fine.