apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.55k stars 3.54k forks source link

[R] Platform-dependent hashes of parquet files? #40202

Open emmamendelsohn opened 8 months ago

emmamendelsohn commented 8 months ago

Describe the bug, including details regarding any error messages, version, and platform.

Moving this from ROpenSci slack. Our team has Mac, Linux, and Windows users, and we have found that we get three different hashes when saving parquet files.

arrow::write_parquet(mtcars, "mtcars.parquet")
digest::digest("mtcars.parquet", file = TRUE)

Mac "05be83226acb5d2a673d922ff9f69414" Linux "8bddf47bdbede54d87ec3c4cbec280da" Windows "bef251d299843f07348248416572edab"

When uncompressed, we get the same hashes for Linux and Windows, different for Mac.

arrow::write_parquet(mtcars, "mtcars.parquet", compression = "uncompressed" )
digest::digest("mtcars.parquet", file = TRUE)

Mac "58ec2e7a6d614db15fc2123455a83a7e" Linux "4f3f049ffebdb395c489864e90d5e36b" Windows "4f3f049ffebdb395c489864e90d5e36b"

arrow_info() for our three systems:

Mac ``` Arrow package version: 14.0.0.2 Capabilities: acero TRUE dataset TRUE substrait FALSE parquet TRUE json TRUE s3 TRUE gcs TRUE utf8proc TRUE re2 TRUE snappy TRUE gzip TRUE brotli TRUE zstd TRUE lz4 TRUE lz4_frame TRUE lzo FALSE bz2 TRUE jemalloc TRUE mimalloc TRUE Memory: Allocator mimalloc Current 0 bytes Max 50.62 Kb Runtime: SIMD Level none Detected SIMD Level none Build: C++ Library Version 14.0.0 C++ Compiler AppleClang C++ Compiler Version 15.0.0.15000040 Git ID 2dcee3f82c6cf54b53a64729fd81840efa583244 ```
Linux ``` Arrow package version: 14.0.0.2 Capabilities: acero TRUE dataset TRUE substrait FALSE parquet TRUE json TRUE s3 TRUE gcs TRUE utf8proc TRUE re2 TRUE snappy TRUE gzip TRUE brotli TRUE zstd TRUE lz4 TRUE lz4_frame TRUE lzo FALSE bz2 TRUE jemalloc TRUE mimalloc TRUE Memory: Allocator jemalloc Current 0 bytes Max 0 bytes Runtime: SIMD Level avx2 Detected SIMD Level avx2 Build: C++ Library Version 14.0.0 C++ Compiler GNU C++ Compiler Version 11.4.0 ```
Windows ``` Arrow package version: 14.0.0.2 Capabilities: acero TRUE dataset TRUE substrait FALSE parquet TRUE json TRUE s3 TRUE gcs TRUE utf8proc TRUE re2 TRUE snappy TRUE gzip TRUE brotli TRUE zstd TRUE lz4 TRUE lz4_frame TRUE lzo FALSE bz2 TRUE jemalloc FALSE mimalloc TRUE Arrow options(): arrow.use_threads FALSE Memory: Allocator mimalloc Current 0 bytes Max 0 bytes Runtime: SIMD Level avx2 Detected SIMD Level avx2 Build: C++ Library Version 14.0.0 C++ Compiler GNU C++ Compiler Version 10.3.0 Git ID 2dcee3f82c6cf54b53a64729fd81840efa583244 ```

Component(s)

Parquet, R

jonkeane commented 8 months ago

Could you try using parquet-tools or parquet cli to inspect the different files and see if there are any differences (if you can, posting the output here for each would be helpful)

I suspect there are differences due to compression or differences between default layouts that would cause different hashes to files like these.

emmamendelsohn commented 8 months ago

Got identical results for the three, other than difference in space saved value.

Mac ``` ############ file meta data ############ created_by: parquet-cpp-arrow version 14.0.0 num_columns: 11 num_rows: 32 num_row_groups: 1 format_version: 2.6 serialized_size: 2823 ############ Columns ############ mpg cyl disp hp drat wt qsec vs am gear carb ############ Column(mpg) ############ name: mpg path: mpg max_definition_level: 1 max_repetition_level: 0 physical_type: DOUBLE logical_type: None converted_type (legacy): NONE compression: SNAPPY (space_saved: 22%) ############ Column(cyl) ############ name: cyl path: cyl max_definition_level: 1 max_repetition_level: 0 physical_type: DOUBLE logical_type: None converted_type (legacy): NONE compression: SNAPPY (space_saved: 0%) ############ Column(disp) ############ name: disp path: disp max_definition_level: 1 max_repetition_level: 0 physical_type: DOUBLE logical_type: None converted_type (legacy): NONE compression: SNAPPY (space_saved: 20%) ############ Column(hp) ############ name: hp path: hp max_definition_level: 1 max_repetition_level: 0 physical_type: DOUBLE logical_type: None converted_type (legacy): NONE compression: SNAPPY (space_saved: 20%) ############ Column(drat) ############ name: drat path: drat max_definition_level: 1 max_repetition_level: 0 physical_type: DOUBLE logical_type: None converted_type (legacy): NONE compression: SNAPPY (space_saved: 9%) ############ Column(wt) ############ name: wt path: wt max_definition_level: 1 max_repetition_level: 0 physical_type: DOUBLE logical_type: None converted_type (legacy): NONE compression: SNAPPY (space_saved: 12%) ############ Column(qsec) ############ name: qsec path: qsec max_definition_level: 1 max_repetition_level: 0 physical_type: DOUBLE logical_type: None converted_type (legacy): NONE compression: SNAPPY (space_saved: 12%) ############ Column(vs) ############ name: vs path: vs max_definition_level: 1 max_repetition_level: 0 physical_type: DOUBLE logical_type: None converted_type (legacy): NONE compression: SNAPPY (space_saved: -4%) ############ Column(am) ############ name: am path: am max_definition_level: 1 max_repetition_level: 0 physical_type: DOUBLE logical_type: None converted_type (legacy): NONE compression: SNAPPY (space_saved: -4%) ############ Column(gear) ############ name: gear path: gear max_definition_level: 1 max_repetition_level: 0 physical_type: DOUBLE logical_type: None converted_type (legacy): NONE compression: SNAPPY (space_saved: 0%) ############ Column(carb) ############ name: carb path: carb max_definition_level: 1 max_repetition_level: 0 physical_type: DOUBLE logical_type: None converted_type (legacy): NONE compression: SNAPPY (space_saved: 8% ```
Linux ``` ############ file meta data ############ created_by: parquet-cpp-arrow version 14.0.0 num_columns: 11 num_rows: 32 num_row_groups: 1 format_version: 2.6 serialized_size: 2823 ############ Columns ############ mpg cyl disp hp drat wt qsec vs am gear carb ############ Column(mpg) ############ name: mpg path: mpg max_definition_level: 1 max_repetition_level: 0 physical_type: DOUBLE logical_type: None converted_type (legacy): NONE compression: SNAPPY (space_saved: 22%) ############ Column(cyl) ############ name: cyl path: cyl max_definition_level: 1 max_repetition_level: 0 physical_type: DOUBLE logical_type: None converted_type (legacy): NONE compression: SNAPPY (space_saved: 0%) ############ Column(disp) ############ name: disp path: disp max_definition_level: 1 max_repetition_level: 0 physical_type: DOUBLE logical_type: None converted_type (legacy): NONE compression: SNAPPY (space_saved: 20%) ############ Column(hp) ############ name: hp path: hp max_definition_level: 1 max_repetition_level: 0 physical_type: DOUBLE logical_type: None converted_type (legacy): NONE compression: SNAPPY (space_saved: 20%) ############ Column(drat) ############ name: drat path: drat max_definition_level: 1 max_repetition_level: 0 physical_type: DOUBLE logical_type: None converted_type (legacy): NONE compression: SNAPPY (space_saved: 9%) ############ Column(wt) ############ name: wt path: wt max_definition_level: 1 max_repetition_level: 0 physical_type: DOUBLE logical_type: None converted_type (legacy): NONE compression: SNAPPY (space_saved: 12%) ############ Column(qsec) ############ name: qsec path: qsec max_definition_level: 1 max_repetition_level: 0 physical_type: DOUBLE logical_type: None converted_type (legacy): NONE compression: SNAPPY (space_saved: 13%) ############ Column(vs) ############ name: vs path: vs max_definition_level: 1 max_repetition_level: 0 physical_type: DOUBLE logical_type: None converted_type (legacy): NONE compression: SNAPPY (space_saved: -4%) ############ Column(am) ############ name: am path: am max_definition_level: 1 max_repetition_level: 0 physical_type: DOUBLE logical_type: None converted_type (legacy): NONE compression: SNAPPY (space_saved: -4%) ############ Column(gear) ############ name: gear path: gear max_definition_level: 1 max_repetition_level: 0 physical_type: DOUBLE logical_type: None converted_type (legacy): NONE compression: SNAPPY (space_saved: 0%) ############ Column(carb) ############ name: carb path: carb max_definition_level: 1 max_repetition_level: 0 physical_type: DOUBLE logical_type: None converted_type (legacy): NONE compression: SNAPPY (space_saved: 8%) ```
Windows ``` ############ file meta data ############ created_by: parquet-cpp-arrow version 14.0.0 num_columns: 11 num_rows: 32 num_row_groups: 1 format_version: 2.6 serialized_size: 2823 ############ Columns ############ mpg cyl disp hp drat wt qsec vs am gear carb ############ Column(mpg) ############ name: mpg path: mpg max_definition_level: 1 max_repetition_level: 0 physical_type: DOUBLE logical_type: None converted_type (legacy): NONE compression: SNAPPY (space_saved: 22%) ############ Column(cyl) ############ name: cyl path: cyl max_definition_level: 1 max_repetition_level: 0 physical_type: DOUBLE logical_type: None converted_type (legacy): NONE compression: SNAPPY (space_saved: 0%) ############ Column(disp) ############ name: disp path: disp max_definition_level: 1 max_repetition_level: 0 physical_type: DOUBLE logical_type: None converted_type (legacy): NONE compression: SNAPPY (space_saved: 20%) ############ Column(hp) ############ name: hp path: hp max_definition_level: 1 max_repetition_level: 0 physical_type: DOUBLE logical_type: None converted_type (legacy): NONE compression: SNAPPY (space_saved: 20%) ############ Column(drat) ############ name: drat path: drat max_definition_level: 1 max_repetition_level: 0 physical_type: DOUBLE logical_type: None converted_type (legacy): NONE compression: SNAPPY (space_saved: 9%) ############ Column(wt) ############ name: wt path: wt max_definition_level: 1 max_repetition_level: 0 physical_type: DOUBLE logical_type: None converted_type (legacy): NONE compression: SNAPPY (space_saved: 12%) ############ Column(qsec) ############ name: qsec path: qsec max_definition_level: 1 max_repetition_level: 0 physical_type: DOUBLE logical_type: None converted_type (legacy): NONE compression: SNAPPY (space_saved: 13%) ############ Column(vs) ############ name: vs path: vs max_definition_level: 1 max_repetition_level: 0 physical_type: DOUBLE logical_type: None converted_type (legacy): NONE compression: SNAPPY (space_saved: -4%) ############ Column(am) ############ name: am path: am max_definition_level: 1 max_repetition_level: 0 physical_type: DOUBLE logical_type: None converted_type (legacy): NONE compression: SNAPPY (space_saved: -4%) ############ Column(gear) ############ name: gear path: gear max_definition_level: 1 max_repetition_level: 0 physical_type: DOUBLE logical_type: None converted_type (legacy): NONE compression: SNAPPY (space_saved: 0%) ############ Column(carb) ############ name: carb path: carb max_definition_level: 1 max_repetition_level: 0 physical_type: DOUBLE logical_type: None converted_type (legacy): NONE compression: SNAPPY (space_saved: 8%) ```
amoeba commented 8 months ago

Thanks for the help here, @emmamendelsohn. Could you zip up all three Parquet files and attach them here?

amoeba commented 8 months ago

I managed to reproduce getting different checksums for files written using macOS and Linux and am attaching them here in case anyone wants to take a look: mtcars-parquet.zip. Both were written with arrow::write_parquet(mtcars, "mtcars.parquet", compression = "uncompressed") using arrow R 14.0.0.2.

When I run parquet-tools inspect on each file with --detail, I get two differences in output. The first is some unlabeled number that's either 262658 or 262914 (diff of 256 which is a bit conspicuous) depending on the file and the second difference is in the KeyValue metadata for the ARROW:schema key. I wonder if the two differences are related.

emmamendelsohn commented 8 months ago

Here are the three files for the compressed example (arrow::write_parquet(mtcars, "mtcars.parquet")). With --detail I see there are differences in file and page offsets.

snappy-mtcars-parquet.zip

pitrou commented 8 months ago

I am not surprised by difference in compression depending on the exact version of the compression library (Snappy), which also depends on the platform and the Arrow version numbers.

pitrou commented 8 months ago

Ok, the uncompressed difference is in the R-specific metadata that's stored with Arrow tables. Either @nealrichardson @jonkeane or @paleolimbot would probably be able to explain what it's about, and why it may vary from platform to platform.

pitrou commented 8 months ago

And, yeah, the format of the "r" metadata is very similar to the example showed in http://richfitz.github.io/redux/reference/object_to_string.html

Under PyArrow:

>>> a = pq.read_table("/home/antoine/arrow/data/mtcars-linux-uncompressed.parquet")
>>> b = pq.read_table("/home/antoine/arrow/data/mtcars-macos-uncompressed.parquet")
>>> a.schema.metadata
{b'r': b'A\n3\n262658\n197888\n5\nUTF-8\n531\n1\n531\n11\n254\n254\n254\n254\n254\n254\n254\n254\n254\n254\n254\n1026\n1\n262153\n5\nnames\n16\n11\n262153\n3\nmpg\n262153\n3\ncyl\n262153\n4\ndisp\n262153\n2\nhp\n262153\n4\ndrat\n262153\n2\nwt\n262153\n4\nqsec\n262153\n2\nvs\n262153\n2\nam\n262153\n4\ngear\n262153\n4\ncarb\n254\n1026\n511\n16\n1\n262153\n7\ncolumns\n254\n'}
>>> b.schema.metadata
{b'r': b'A\n3\n262914\n197888\n5\nUTF-8\n531\n1\n531\n11\n254\n254\n254\n254\n254\n254\n254\n254\n254\n254\n254\n1026\n1\n262153\n5\nnames\n16\n11\n262153\n3\nmpg\n262153\n3\ncyl\n262153\n4\ndisp\n262153\n2\nhp\n262153\n4\ndrat\n262153\n2\nwt\n262153\n4\nqsec\n262153\n2\nvs\n262153\n2\nam\n262153\n4\ngear\n262153\n4\ncarb\n254\n1026\n511\n16\n1\n262153\n7\ncolumns\n254\n'}
>>> a.schema.metadata == b.schema.metadata
False
pitrou commented 8 months ago

By the way, 262658 is 0x40202 while 262914 is 0x40302, so this might very well be dependent on the R version you generated those files with (4.2.2 vs. 4.3.2?). Probably easy to verify.

emmamendelsohn commented 8 months ago

All files from my example with R 4.3.2.

pitrou commented 8 months ago

@emmamendelsohn Ah, I was talking about the uncompressed example from @amoeba . As I said above, differences in compressed files should not be a surprise. Do you still see differences if you generate uncompressed files?

emmamendelsohn commented 8 months ago

I see. Yes for uncompressed we found Linux and Windows had the same hash, while macOS was different, all on 4.3.2. Let me know if you'd like me to share those files.

pitrou commented 8 months ago

Thank you! Yes, you can share the Linux and macOS files for example.

(I suspect the final reason will be similar: slightly different R metadata serialized, for which I'll let R-Arrow experts answer :-))

amoeba commented 8 months ago

Thanks for looking at this @pitrou, the R version and metadata causing the issue makes sense. I'll look into what we're doing in that regard next.

emmamendelsohn commented 8 months ago

Actually, I was mistaken, all three systems have different hashes when uncompressed. This matches @amoeba's example above. uncompressed-mtcars-parquet.zip

pitrou commented 8 months ago

Thanks @emmamendelsohn . After taking a quick look: 1) all three files differ only in the Parquet metadata, not the actual data 2) once deserialized, the Arrow schema is the same, except for R metadata (depending on R version perhaps: it might have been 4.3.3 on Linux vs. 4.3.2 on Windows and Mac?) 3) hence, most of the difference seems to be in the way the Arrow schema is serialized by flatbuffers. This is certainly harmless as long as the data is the same once deserialized.

Is there a particular reason you were wondering about these files being different?

pitrou commented 8 months ago

This is an interesting flatbuffers commit message as we do have a similar piece of code. And binary inspection of the serialized Flatbuffers metadata seems to match this interpretation.

emmamendelsohn commented 8 months ago

@pitrou the different hashes became an issue for our team using a collaborative R targets workflow. In short, we use a shared S3 bucket for object storage so that each user can easily access the same versioned objects. This is especially useful for things like model objects that take a long time to produce. However, for large raw data files, we've found that the cost of transferring to/from AWS is too high, so each user saves the files locally as parquets. The targets version tracking system needs to register that these local files have the expected hash to be able to run downstream endpoints. When the file hashes differ across systems, targets detects a change and invalidates subsequent endpoints.

Anyway, we're rethinking some aspects of this approach, and so this may not be relevant in the future. Appreciate you looking into it nonetheless!

pitrou commented 8 months ago

Yes, I think you should probably reconsider, because it is not realistic to expect a sophisticated compression-based format like Parquet to always generate the same bitwise data using slightly different producers.

emmamendelsohn commented 8 months ago

Makes sense!

noamross commented 8 months ago

Would @nealrichardson @jonkeane or @paleolimbot be able to explain the R-specific metadata that generated maybe point to the code in the package where this occurs? From a quick inspection it looks a summary of the data frame schema in R's ASCII serialization format.

amoeba commented 8 months ago

@noamross it looks like we do that here https://github.com/apache/arrow/blob/9ca7d787402c715ee84c1bb21cfca0e54ae2f12d/r/R/metadata.R#L19-L33

(calling into serialize as you guessed)

nealrichardson commented 8 months ago

@noamross IIRC the purpose of this is so that object attributes, including R class names, is preserved so that you can round-trip the data to parquet or arrow files and get the same R types back. If you had a bare data.frame and only vanilla R vector types, I would expect the metadata to be empty.