Closed asfimport closed 2 years ago
Will Jones / @wjones127: Hi Christian,
That's very odd. Could you check these two things to help us identify the issue?
{}arrow::arrow_info(){
}. I'm particularly interested check which allocator/memory pool is being used. You can change that with the ARROW_DEFAULT_MEMORY_POOL
environment variable . as_data_frame = FALSE
into arrow::read_arrow()
? That could help us determine if it's due to the file read or the conversion to an R data frame.Christian: Apologies for the late reply. I just checked it after a full computer restart and it is exactly the same problem. Interestingly this time the full memory usage went to 90gb and then after deleting+gc() it got stuck at 60gb. So same problem - just a little bit lower total numbers. It holds both in Rstudio and in a R terminal.
Below the requested outputs. I also added what it shows on a gc() and what windows shows as resource usage.
This happens with as_data_frame=T (default setting of read_arrow), given that I don't need to make any changes to the df when loading it in.
And to reiterate - under Linux it frees up all resources after calling gc().
arrow::arrow_info() Arrow package version: 6.0.1
Capabilities:
dataset TRUE parquet TRUE json TRUE s3 TRUE utf8proc TRUE re2 TRUE snappy TRUE gzip TRUE brotli FALSE zstd TRUE lz4 TRUE lz4_frame TRUE lzo FALSE bz2 FALSE jemalloc FALSE mimalloc TRUE
Arrow options():
arrow.use_threads FALSE
Memory:
Allocator mimalloc Current 0 bytes Max 34.31 Gb
Runtime:
SIMD Level avx512 Detected SIMD Level avx512
Build:
C++ Library Version 6.0.1 C++ Compiler GNU C++ Compiler Version 8.3.0 Git ID d132a740e33ec18c07b8718e15f85b4080a292ff
gc() used (Mb) gc trigger (Mb) max used (Mb) Ncells 1792749 95.8 3428368 183.1 2914702 155.7 Vcells 4673226 35.7 2939373019 22425.7 3943230076 30084.5
ls() character(0)
Christian: Also if you want me to test the problem with a specific arrow5 file I'm happy to do so.
Christian: I did some more testing (all the reading is done within R and Arrow 6.0.1). It looks like there's a few things here:
1) I read a file that was written in Arrow 5 (the file is {}30{}gb and was written directly with the C#/C++ interface) - that one increases the memory usage to ~30-38. But then on gc() the memory usage goes down to 8gb and doesn't free up everything. I'm not sure why that is but that's acceptable. The file only has chr/Date/num/int. Calling arrow_info yields the following (same result after loading/deleting the df).
Allocator mimalloc Current 0 bytes Max 0 bytes
2) Reading the file from last week ({}10{}gb written in Arrow 6.0.1 from R) yields again the same result as last week. Note that here I have also the factor/logical types which arrow seems to store and read.
Allocator mimalloc Current 4.19 Kb Max 34.31 Gb
3) As a test I did a write_arrow on the file from 2), but I did an unfactor on all the factor columns. Same issue as in 2). So it doesn't look like it is the factor type that's the issue.
4) As a final test I read the file from 1) and did a write_arrow on it from R. The issue comes up again after reading it back in.
Before deletion:
Allocator mimalloc Current 28.2 Gb Max 28.2 Gb
After deletion:
Allocator mimalloc Current 0 bytes Max 28.2 Gb
So the issue seems to be with writing the arrow file from R. All I do is to call a write_arrow('file.arrow5'). Is there a problem with that?
Christian: As one final test I wrote the arrow file in 4 different ways:
c("default", "lz4", "uncompressed", "zstd") %>% walk(~{ log_info(.x) write_feather( testdf, glue('C:/Temp/dftest{.x}.arrow'), compression = .x ) })
It seems that only when writing it uncompressed it does not have the memory issue - then it behaves as expected:
the below holds as well.
Allocator mimalloc Current 0 bytes Max 0 bytes
Does this make sense to anyone? Is this a bug or is this expected behavior in Windows?
As said in Linux I don't have that issue (even though the max memory usage jumps up to twice the object size during reading, it is being freed up again).
Will Jones / @wjones127: Hi Christian,
Did you do any tests with as_data_frame = FALSE?
There are two separate stores of memory: R's memory and Arrow's memory pool (on Windows defaults to mimalloc). arrow_info
prints out the stats for the Arrow memory pool; it looks like it's freeing things correctly, right? When you call {}gc(){
}, that affects R's memory system. It sounds like R isn't freeing memory there correctly. Testing with as_data_frame = FALSE
would help confirm this.
If what I've said above is correct, it seems like either there might be a bug in the R for Windows, or there is something wrong in Arrow. If you could describe the data a little more (or even share a sample), perhaps I could reproduce it?
Christian: I have not tried that yet but will do so soon.
As said reading a non-compressed version seems to have worked - not sure if that indicates what the issue might be.
Unfortunately I can't share the data set but given that I was able to reproduce it with 2 different data sets I don't think it has something to do with the data itself. If you have a reference arrow file you want me to download and that we both can test I'm very happy to do so.
Will Jones / @wjones127:
[~Klar]
to be more specific, could you share the output of the following:
options(arrow.use_threads=FALSE);
arrow::set_cpu_count(1); # need this - otherwise it freezes under windows
table <- arrow::read_arrow('file.arrow5', as_data_frame = FALSE)
arrow_info()$memory
gc()
table$schema
Christian: Note that this is with a cut down version. Here the size of the file is ~1gb and is written with the "default" compression. It takes about 3-5gb when reading it into R. And the space that doesn't get freed up is ~7gb.
(Deleting table and running another gc() keeps the 7gb allocated.)
arrow_info()$memory $backend_name [1] "mimalloc"
$bytes_allocated [1] 5379819648
$max_memory [1] 5379819648
$available_backends [1] "mimalloc" "system"
gc() used (Mb) gc trigger (Mb) max used (Mb) Ncells 2625778 140.3 4439937 237.2 4439937 237.2 Vcells 7082576 54.1 12255594 93.6 9236142 70.5 table$schema Schema : int32 : date32[day] : string : string string date32[day] string string double double double int32 int32 string double string double double string double double string string double string bool date32[day] string string string string string string int32 int32 int32 int32 int32 int32 string int32 string int32 string string string string string string string string string string string string string string string string string double double int32 double date32[day] dictionary
dictionary dictionary dictionary dictionary dictionary dictionary dictionary dictionary dictionary dictionary dictionary dictionary dictionary dictionary dictionary dictionary dictionary dictionary dictionary dictionary dictionary
Will Jones / @wjones127: Not able to reproduce yet, but I am noticing the memory usage reporting is a little confusing in Rstudio. To make sure we are on the same page, my test code is below.
Does this align with how you are measuring memory use? Or is there somewhere else?
Does this example data show the memory leaking behavior you are seeing? {code:r} library(arrow)
print_memory <- function() { print(sprintf("Arrow: %s MB", trunc(arrow_info()$memory$bytes_allocated / 1024 / 1024))) print(sprintf("R: %s MB", gc()["Vcells", 2])) }
Create example data size <- 1E8
my_table <- arrow_table( x = Array$create(sample(letters, size, replace = TRUE)), y = Array$create(as.factor(sample(letters, size, replace = TRUE))), z = Array$create(as.Date(1:size, as.Date("2020-01-01"))), a = Array$create(1:size, type=int32()) )
arrow::write_arrow(my_table, "file.arrow5") remove(my_table)
Note: you may need to wait a few seconds for Arrow memory pool to free memory print_memory()
[1] "Arrow: 0 MB"
[1] "R: 14.1 MB"
options(arrow.use_threads=FALSE);
arrow::set_cpu_count(1); # need this - otherwise it freezes under windows
table <- arrow::read_arrow('file.arrow5') print_memory()
[1] "Arrow: 1335 MB"
[1] "R: 1158.5 MB"
remove(table) print_memory()
[1] "Arrow: 0 MB"
[1] "R: 14.1 MB"{code}
Christian: Yes this is enough to reproduce the issue. I ran it.
When deleting the table object the stub memory usage is ~2.5gb - I can't get it back no matter how many gc()'s I do - but I can get it back when either reading it in on a Linux machine or reading an uncompressed arrow file.
I'm generally measuring memory usage ballpark when looking at the process explorer. It's not exact but still gives a good view if something isn't being freed up.
Jameel Alsalam:
Hello, I think I have reproduced the issue here. About 1.5 GB appears to still be in use after the remove statement. I am on CRAN arrow 7.0.0. I was interested in this issue because I have tried to diagnose a different arrow memory issue involving write_dataset. In my investigations, the memory reported internally by gc() or arrow is quite different than what is reported by Windows via e.g., task manager. I have found a way to get the system task manager-like memory by running: system2("tasklist", stdout=TRUE)
and then filtering for the right process. Pasted below I ran your script with the additional memory info.
library(arrow)
#>
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#>
#> timestamp
print_memory <- function() {
print(sprintf("Arrow: %s MB", trunc(arrow_info()$memory$bytes_allocated / 1024 / 1024)))
print(sprintf("R: %s MB", gc()["Vcells", 2]))
print((function(t) t[grep(Sys.getpid(), t)])(system2("tasklist", stdout = TRUE)))
}
1. Create example data
size <- 1E8
print_memory()
#> [1] "Arrow: 0 MB"
#> [1] "R: 9.8 MB"
#> [1] "Rterm.exe 19096 Console 2 125,672 K"
my_table <- arrow_table(
x = Array$create(sample(letters, size, replace = TRUE)),
y = Array$create(as.factor(sample(letters, size, replace = TRUE))),
z = Array$create(as.Date(1:size, as.Date("2020-01-01"))),
a = Array$create(1:size, type=int32())
)
arrow::write_arrow(my_table, "file.arrow5")
#> Warning: Use 'write_ipc_stream' or 'write_feather' instead.
remove(my_table)
1. Note: you may need to wait a few seconds for Arrow memory pool to free memory
Sys.sleep(5)
print_memory()
#> [1] "Arrow: 953 MB"
#> [1] "R: 392.6 MB"
#> [1] "Rterm.exe 19096 Console 2 563,344 K"
options(arrow.use_threads=FALSE);
arrow::set_cpu_count(1); # need this - otherwise it freezes under windows
table <- arrow::read_arrow('file.arrow5')
#> Warning: Use 'read_ipc_stream' or 'read_feather' instead.
print_memory()
#> [1] "Arrow: 1335 MB"
#> [1] "R: 1156.2 MB"
#> [1] "Rterm.exe 19096 Console 2 2,709,252 K"
remove(table)
Sys.sleep(5)
print_memory()
#> [1] "Arrow: 858 MB"
#> [1] "R: 11.8 MB"
#> [1] "Rterm.exe 19096 Console 2 1,534,436 K"
Created on 2022-02-22 by the reprex package (v2.0.1)
Jameel Alsalam: I'm sorry for the garbled output. I am rendering using the reprex package, but I can't figure out how to make it look nice in JIRA.
Will Jones / @wjones127: Yes I think I can reproduce this. Essentially, R and Arrow report freeing that memory, but the OS is reporting that memory as still used. I think this is actually expected behavior for the underlying memory pools; they tend to not release memory very aggressively with the expectation that it will reuse it.
If you instead use the system allocator, you should see this issue go away (I did when I tested on my local):
Sys.setenv(ARROW_DEFAULT_MEMORY_POOL="system") # You must run this before library(arrow)
library(arrow)
arrow_info()$memory$backend_name
# [1] "system"
However, depending on your application you probably don't want to do this. While it may appear to use more memory, it's not necessarily the case that the way the allocator is handling memory is worse. See these discussions:
https://issues.apache.org/jira/browse/ARROW-14790?focusedCommentId=17447365
To pull out one quote from one of the maintainers of mimalloc:
However, generally mimalloc will only hold on to virtual memory and will return physical memory to the OS. Now, generally mimalloc flags unused memory as available to the OS and the OS will use that memory when there is memory pressure (MEM_RESET on windows, MADV_FREE on Linux) – however, the OS does not always show that memory as available (even though it is) as it is only reclaimed under memory pressure.
There is some possibility that there is a bug in the mimalloc version we are using (1.7.3), but the next release (2.0.x) is still in alpha: https://github.com/microsoft/mimalloc/issues/383
Christian: I have a conceptual question though: Does the current setup create a "copy" of the file within the arrow memory (meaning does it read the entire file into arrow, and then loads it into R)? Because for large data.frames the double counting would be an issue?
And additionally even if isn't fully a memory leak it seems that once I delete the object, and then load another one, the space isn't used at all - Arrow is reserving more/incremental memory. So the system just starts running out of space.
Will Jones / @wjones127:
I have a conceptual question though: Does the current setup create a "copy" of the file within the arrow memory (meaning does it read the entire file into arrow, and then loads it into R)? Because for large data.frames the double counting would be an issue? Yes that's right. Basically all the Arrow readers will first read the data into Arrow format, and then if you asked for it as an R dataframe, will "convert" to a data frame. Now some of the conversion is really cheap; I think simple vectors like integer and numeric are essentially zero copy. IIRC there are some like structs that require more transformation to become their R analogue.
Will Jones / @wjones127:
And additionally even if isn't fully a memory leak it seems that once I delete the object, and then load another one, the space isn't used at all - Arrow is reserving more/incremental memory. So the system just starts running out of space. Then that sounds like you might be encountering the bug from mimalloc I mentioned, which out of scope for this Arrow unfortunately. I think I'll create a new issue to look at mimalloc V2 which supposedly doesn't have this problem; according to the author the only reason not to use it is that is that some users report performance regressions in moving from v1 to v2.
Christian: Okay thanks. As said I will try it with "system" and see if it goes better. I'm not sure if I will be able to test it before Sunday but it would be great if you could leave this issue open for the case that I find some other problems.
Christian: Three additional questions:
Did the memory model (i.e. keeping a copy within arrow) change after 0.15 or was it introduced afterwards? That was the previous version I was using and I never had these kind of memory "issues" with it (understood that issues is not necessarily the right word). The double counting just seems very punitive. I just tried "system" and it does free it up (as you said) but for a while R is using about 70gb when the actual object size within R is just 30gb.
Do you know if R Factors (show up as dictionary in the table schema) are especially punitive compared to strings?
Were you able to set Sys.setenv(ARROW_DEFAULT_MEMORY_POOL="system") within Rstudio? I tried a few different ways and it always just shows me mimalloc. It does work in a R console window (which is where I did the above test). I even completely restarted Rstudio - whatever I do it stays at mimalloc.
For the last point see below:
Sys.setenv(ARROW_DEFAULT_MEMORY_POOL="system") Sys.getenv('ARROW_DEFAULT_MEMORY_POOL') [1] "system" library(arrow)
Attaching package: ‘arrow’
The following object is masked from ‘package:utils’:
timestamp
arrow_info()$memory$backend_name [1] "mimalloc"
Will Jones / @wjones127:
Did the memory model (i.e. keeping a copy within arrow) change after 0.15 or was it introduced afterwards? That was the previous version I was using and I never had these kind of memory "issues" with it (understood that issues is not necessarily the right word). The double counting just seems very punitive. No, as far as I know it should have gotten better since then, with fewer copies made. This has been done by implementing "altrep" conversions to R vectors, which allows R to use the existing Arrow array memory instead of copying data. For each new version we've implemented this for additional data types. For example, here's the PR for integers and numerics vectors: https://github.com/apache/arrow/pull/10445.
However, I just noticed that altrep was just implemented for ChunkedArray in 7.0.0, so you might not be getting the full benefit in 6.0.1 (since your 30GB file is most likely made up of multiple chunks). So it is likely worth retrying in 7.0.0.
I just tried "system" and it does free it up (as you said) but for a while R is using about 70gb when the actual object size within R is just 30gb. It's hard to do any computation (read, aggregate, write, whatever) without creating some sort of intermediate result. For a 30GB file, that sounds pretty normal. You saying you can measure lower peak memory use in Arrow 0.15? Do you know if R Factors (show up as dictionary in the table schema) are especially punitive compared to strings? It's hard to say, and I think depends on your R version. But in 6.0.1 altrep was implemented for strings, and it won't be implemented for factors until 8.0.0. I think the best thing to do would be to save a file with just a string or just a factor and then test the peak vs result memory of each. Were you able to set Sys.setenv(ARROW_DEFAULT_MEMORY_POOL="system") within Rstudio? I tried a few different ways and it always just shows me mimalloc. It does work in a R console window (which is where I did the above test). I even completely restarted Rstudio - whatever I do it stays at mimalloc.
Yes I tested in Rstudio. Make sure to do Session > Restart R before you do this.
Christian: Please ignore below - It still isn't working but I just put it into the Rprofile for now and that actually seems to set it correctly.
Thanks.
So I tried pretty much everything in Rstudio but no dice. I'm literally running the following commands right away when starting RStudio - but it always comes back with mimalloc.
Sys.setenv(ARROW_DEFAULT_MEMORY_POOL="system") Sys.getenv('ARROW_DEFAULT_MEMORY_POOL') library(arrow) arrow_info()$memory$backend_name
Within a normal R console it works fine.
Hi,
I'm trying to load a ~10gb arrow file into R (under Windows)
(The file is generated in the 6.0.1 arrow version under Linux).
For whatever reason the memory usage blows up to ~110-120gb (in a fresh and empty R instance).
The weird thing is that when deleting the object again and running a gc() the memory usage goes down to 90gb only. The delta of ~20-30gb is what I would have expected the dataframe to use up in memory (and that's also approx. what was used - in total during the load - when running the old arrow version of 0.15.1. And it is also what R shows me when just printing the object size.)
The commands I'm running are simply:
options(arrow.use_threads=FALSE);
arrow::set_cpu_count(1); # need this - otherwise it freezes under windows
arrow::read_arrow('file.arrow5')
Is arrow reserving some resources in the background and not giving them up again? Are there some settings I need to change for this?
Is this something that is known and fixed in a newer version?
Note that this doesn't happen in Linux. There all the resources are freed up when calling the gc() function - not sure if it matters but there I also don't need to set the cpu count to 1.
Any help would be appreciated.
Reporter: Will Jones / @wjones127 Assignee: Will Jones / @wjones127
Related issues:
Original Issue Attachments:
Note: This issue was originally created as ARROW-15730. Please see the migration documentation for further details.