apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
13.9k stars 3.38k forks source link

[R] write_parquet() a large data.table with index will make the parquet file unreadable #40050

Open cravetheflame opened 4 months ago

cravetheflame commented 4 months ago

Describe the bug, including details regarding any error messages, version, and platform.

Saving a data.frame with a big attribute (like an index commonly used in the data.table package) will make the parquet file unreadable and produce this error :

Error in `open_dataset()`:
! IOError: Error creating dataset. Could not read schema from 'path/example.parquet'. 
Is this a 'parquet' file?: 
Could not open Parquet input source 'path/example.parquet':
Couldn't deserialize thrift: TProtocolException: Exceeded size limit

Bug as understood by the stackoverflow issue : The normal efficiency of binary-data-storage in parquet files is not afforded to R attributes, so a big attribute (like data.table indexes) would break the format.

This bug has important reliability implications for the parquet format

Repex :

library(arrow)
library(data.table)
# Seed
set.seed(1L)
# Big enough data.table 
dt = data.table(x = sample(1e5L, 1e7L, TRUE), y = runif(100L)) 
# Save in parquet format
write_parquet(dt, "example_ok.parquet")
# Readable
dt_ok <- open_dataset("example_ok.parquet")
# Simple filter 
dt[x == 989L]
# Save in parquet format
write_parquet(dt, "example_error.parquet")
# Error
dt_error <- open_dataset("example_error.parquet")

Component(s)

Parquet, R

amoeba commented 4 months ago

Hi @Yannaubineau, thanks for the report. Both the R and Python packages have tests covering this behavior so it's a known issue. Though, as you found out, they will happily write a Parquet file that can't be read in cases like this.

A work-around for now would be to pass an extra option to open_dataset that sets the limit to a high-enough value:

dt_error <- open_dataset("./example_error.parquet", thrift_string_size_limit=1000000000)

I'm not sure we want to increase or remove the default limit, as that might cause other problems. @Yannaubineau do you think a more informative error would be enough of a fix here?

cravetheflame commented 4 months ago

Hi @amoeba, thank you for your answer. Sorry if it is a non-issue.

The main problem I faced was the lack of indication of the source of the error, and the absence of warning prior to "creating" the error.

Thank you for the sample code, it works like a charm !

I think there is two aspects to this :

felipecrv commented 4 months ago

The error output is confusing, Is this a 'parquet' file? doesn't feel right if the error is known related to a string size limit parameter. So informing the user of this parameter inside the error message would definitely be an improvement.

It's actually more likely that user pointed the reader to a random file that starts with bytes encoding a huge length value for the metadata string. Note that after asking if it's a Parquet file, it says Couldn't deserialize thrift: TProtocolException: Exceeded size limit -- perhaps that message could be improved with a hint on how to increase the Thrift size limit.

amoeba commented 4 months ago

Thanks @Yannaubineau. I do think this is an issue and I think (1) warning when writing such a file and (2) giving the user a better error when reading one are both good improvements here. Would you have any interest in contributing either or both?

cravetheflame commented 4 months ago

I would sure be interested but I failed to even find where the text message originated from in the code base - so it feels more right for someone else to do it

felipecrv commented 4 months ago

I believe the message comes from here

https://github.com/apache/arrow/blob/main/cpp/src/parquet/thrift_internal.h#L444-L463

ThriftException is converted to an ~arrow::Status~ ParquetException, so more text could be added depending on what the thrift exception is about.