Open cravetheflame opened 4 months ago
Hi @Yannaubineau, thanks for the report. Both the R and Python packages have tests covering this behavior so it's a known issue. Though, as you found out, they will happily write a Parquet file that can't be read in cases like this.
A work-around for now would be to pass an extra option to open_dataset
that sets the limit to a high-enough value:
dt_error <- open_dataset("./example_error.parquet", thrift_string_size_limit=1000000000)
I'm not sure we want to increase or remove the default limit, as that might cause other problems. @Yannaubineau do you think a more informative error would be enough of a fix here?
Hi @amoeba, thank you for your answer. Sorry if it is a non-issue.
The main problem I faced was the lack of indication of the source of the error, and the absence of warning prior to "creating" the error.
Thank you for the sample code, it works like a charm !
I think there is two aspects to this :
Is this a 'parquet' file?
doesn't feel right if the error is known related to a string size limit parameter. So informing the user of this parameter inside the error message would definitely be an improvement. write_parquet
or write_dataset
) from a data.frame containing attributes, simply because of how massive the size of the parquet file can get compared to the same data.frame without any attribute. Users should be aware in some way that attributes in their data are impeding on the efficiency of the binary-data-storage.The error output is confusing, Is this a 'parquet' file? doesn't feel right if the error is known related to a string size limit parameter. So informing the user of this parameter inside the error message would definitely be an improvement.
It's actually more likely that user pointed the reader to a random file that starts with bytes encoding a huge length value for the metadata string. Note that after asking if it's a Parquet file, it says Couldn't deserialize thrift: TProtocolException: Exceeded size limit
-- perhaps that message could be improved with a hint on how to increase the Thrift size limit.
Thanks @Yannaubineau. I do think this is an issue and I think (1) warning when writing such a file and (2) giving the user a better error when reading one are both good improvements here. Would you have any interest in contributing either or both?
I would sure be interested but I failed to even find where the text message originated from in the code base - so it feels more right for someone else to do it
I believe the message comes from here
https://github.com/apache/arrow/blob/main/cpp/src/parquet/thrift_internal.h#L444-L463
ThriftException
is converted to an ~arrow::Status~ ParquetException
, so more text could be added depending on what the thrift exception is about.
Describe the bug, including details regarding any error messages, version, and platform.
Saving a data.frame with a big attribute (like an index commonly used in the
data.table
package) will make the parquet file unreadable and produce this error :Bug as understood by the stackoverflow issue : The normal efficiency of binary-data-storage in parquet files is not afforded to R attributes, so a big attribute (like
data.table
indexes) would break the format.This bug has important reliability implications for the parquet format
Repex :
Component(s)
Parquet, R