apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.28k stars 3.47k forks source link

read_feather error with 32 GB file #11665

Open EB80 opened 2 years ago

EB80 commented 2 years ago

arrow v.6.0.0.2 results in the following error when attempting read_feather on a 32 GB feather file:

Error: Invalid: Invalid read (offset = 7140512496, size = -956703880)

jorisvandenbossche commented 2 years ago

@EB80 Thanks for the report!

Would you be able to share some more info? (for example the full traceback, how you created the file, whether you can read it with an older version, and if so what's the schema is, ..?) Or if possible provide a reproducible example?

EB80 commented 2 years ago

Thanks for the reply,

Additional information: -The feather was written with feather::write_feather and not arrow::write_feather -I receive the same error with older versions of arrow (currently running 6.0.0.2) -I actually encounter a separate issue attempting to write with arrow::write_feather. 78MB of the 32GB file will write, and then the writing hangs indefinitely without continuing or crashing. (I used feather::write_feather for this reason.) -I am writing to a network drive and not a local disk -The dataframe is 26M rows and 150 columns

Traceback: 6: stop(e) 5: value[3L] 4: tryCatchOne(expr, names, parentenv, handlers[[1L]]) 3: tryCatchList(expr, classes, parentenv, handlers) 2: tryCatch(reader$Read(columns), error = read_compressed_error) 1: arrow::read_feather("Archive/2021-06-23/2021-06-23 - Pre-Processed DECKPLATE - June 2021.feather")

Thank you, Edward

pitrou commented 2 years ago

The feather was written with feather::write_feather and not arrow::write_feather

The standalone feather package is an extremely old package that is not supported anymore, so I would not recommend using it.

78MB of the 32GB file will write, and then the writing hangs indefinitely without continuing or crashing

Is the CPU busy when doing this or idle? Do you see any disk activity?

jorisvandenbossche commented 2 years ago

Aother suggestion: if it was written with the feather package, could you try reading the file with that package? And if that works, write the file again using arrow::write_feather

thisisnic commented 2 years ago

In addition to pitrou's comment on retrying arrow::write_feather() and checking the CPU activity, if you like, you could generate more diagnostic information by running the code with the C++ debugger attached.

If you want to do this, there are some instructions in the dev version of the docs (https://ursalabs.org/arrow-r-nightly/articles/developers/debugging.html) but here's a short version:

  1. Start up R in a terminal with the debugger using either R -d gdb or R -d lldb - hopefully one of these should work with your OS
  2. Type run to run R
  3. Run the code you run which causes the session to hang

At this point, there should either be a load of extra output detailing the cause of the problem or the session will just hang again. If it just hangs, press Crtl+C to stop the debugger and then type in thread apply all bt - this will generate a tonne of output but it'll be really useful stuff for us!

EB80 commented 2 years ago

Thanks all-

I read the file in with feather::read_feather, and there was not an issue. I'm assuming this was the problem with arrow::read_feather (i.e., the file was written with feather vice arrow), but now we're back to the writing error.

When I attempt to write the dataframe with arrow::write_feather, there is not disk activity once it encounters the indefinite hanging. There is a ton of memory usage, but that's expected with a huge dataframe in the RStudio workspace. I excluded the possibility of some sort of an issue with the dataframe itself by writing successfully with data.table::fwrite.

Regarding dbg, the MinGW download failed for some reason, and I haven't figured out an alternative way to debug via the Windows Command Prompt. (Help for a simpleton would be appreciated.)

Interestingly, while the arrow::write_feather has hung every time so far, it stopped at 156MB one time and 78MB every other time.

pitrou commented 2 years ago

@EB80 Is there any way you can share the code to produce the file (or a similar file that would fail loading with arrow::read_feather but succeed with feather::read_feather), so that we can try to reproduce?

EB80 commented 2 years ago

I expect that the issue with arrow::read_feather was just because I had used the very old feather::write_feather to write the file.

I have the following code to test arrow::write_feather:

rm(list = ls())

# set the wd to be where this script is saved
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))

# set dimensions based on the real file
numRows = 26e6 # 26M rows in the real file
numCols = 150 # 150 columns in the real file

# whip up a fake dataframe
fakeDataframe <- as.data.frame(matrix("fake string", numRows, numCols))

# change the column names for aesthetic purposes, I guess
names(fakeDataframe) <- sprintf("Fake Column %s", 1:150)

# save the fake file with data.table
data.table::fwrite(fakeDataframe, "fakeFile.csv")

# save the fake file with arrow
arrow::write_feather(fakeDataframe, "fakeFile.feather")

The fwrite step took about 10 minutes to write. While the dimensions of the fake file match those of the real file, the size on the disk is much larger (46 GB vice 32 GB). I wrote a while loop to trim off rows from the fake file until it matched the real file, but object.size() was painfully slow. Either way, I figured this would be a suitable test. arrow::write_feather hangs with this fake dataframe just as before.

pitrou commented 2 years ago

cc @wesm

wesm commented 2 years ago

I recall a number of issues writing data frames over 4GB with the old feather package, there were some other issues in the past, and there may be some existing Jira issues, too.

https://github.com/wesm/feather/issues/380 https://github.com/wesm/feather/issues/372

It would be good to get to the bottom of the various issues here at least from the arrow:: package side. I have no idea why writing a FileOutputStream to a network drive would hang (which is what sounds like is happening)

rockhowse commented 2 years ago

Did some extensive testing of this scenario on both Windows 10 (32GB and 128GB RAM machines) as well as MacOSX (32GB) machine. I tested all using R 4.1.2 and arrow 6.0.1 as those are the latest stable versions of R and the arrow module from CRAN. The most consistent producer of the scenario was the Windows 10 64GB machine using default Page File configuration.

All testing steps, observations and screenshots of RStudio + Memory/SWP consumption while testing I popped into this github repo:

https://github.com/rockhowse/apache_arrow_32GB_write_feather

Prime example of the state you can get into even with 128GB of RAM and the provided R code is shown on this screenshot.

Here's what I see occurring:

Interestingly enough... when testing the same versions of R and arrow on MacOSX with a 32GB box it does in fact complete, but takes forever due to MacOSX having to constantly resize the SWP partition to accommodate the increase requirement for virtual memory. It peaks out at/around 32GB of SWP.

I haven't yet re-configured my windows developer environment on my new rig, but will set it up and see what the debugger has to say. My gut feeling is that we are in a DOM vs SAX type scenario where the feather write implementation as it's currently being consumed in R is doing a lot more memory consumption than might be obvious with smaller data sets. Given there is a large period of time where there is no disk IO but lots of CPU and memory it would lead me to believe it's not "streaming" as effectively as it could be.

Hopefully this is helpful. I used this issue as a way for me to get back up to speed with arrow as it's been a couple years. I think I will start by using R debugging as suggested by @thisisnic. I am not well versed in lower level debugging in R, but if I can hook into a debug build of the native arrow lib being used, it should open up a bit more visibility into what's going on under the hood during that call.

thisisnic commented 2 years ago

@romainfrancois Jon suggested I ask you if there is any batching happening in the R data frame to Arrow table conversion? If not, that could explain what's happening here.

pitrou commented 2 years ago

@paleolimbot @wjones127 Perhaps one of you would like to take a cursory look at this?

paleolimbot commented 2 years ago

I can confirm that there's no batching happening (1) when converting the data.frame to a Table or (2) when converting an R vector to a ChunkedArray: no matter how big the data frame, it will always (to my reading of the code) be completely converted to a Table whose member chunked arrays consist of a single chunk prior to getting written as Feather.

There is an open Jira (ARROW-15405) to allow write_ipc_stream(), write_feather(), and write_parquet() to acceptRecordBatchReader(write_csv_arrow()already does, thanks to Nic!). In combination with a chunkingas_record_batch_reader()method fordata.frame`, that would almost certainly solve the hang-on-write issue.

It sounds like the root cause of the hang, though, is that something about the data frame to feather write operation is using a lot more memory in some cases than anybody can explain. It would be helpful to have a minimal reproducer for that that runs in a reasonable amount of time...I know how to profile R memory usage (bench::mark() will do it using the profmem package), but I don't have a strategy to profile other allocations other than inspecting the default memory pool.

Something that crossed my mind as I was writing the reprex below is that in R we have a global string pool, which means that c("string one", "string one", "string one") only stores "string one" once. If we expand that to an Arrow string() all at once, we copy "string one" a lot of times and potentially use more memory than lobstr::obj_size() might suggest.

Perhaps a starting place:

library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.

big_df <- vctrs::vec_rep(ggplot2::mpg, 1e4)
lobstr::obj_size(big_df)
#> 168.49 MB
tf <- tempfile()

bench::as_bench_bytes(default_memory_pool()$max_memory)
#> [1] 0B

bench::mark(write_feather(big_df, tf))
#> # A tibble: 1 × 6
#>   expression                     min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 write_feather(big_df, tf)    224ms    230ms      4.37    3.02MB        0
bench::as_bench_bytes(default_memory_pool()$max_memory)
#> [1] 392MB
bench::as_bench_bytes(file.size(tf))
#> [1] 56.5MB

Created on 2022-08-04 by the reprex package (v2.0.1)