Open emmamendelsohn opened 8 months ago
Could you try using parquet-tools or parquet cli to inspect the different files and see if there are any differences (if you can, posting the output here for each would be helpful)
I suspect there are differences due to compression or differences between default layouts that would cause different hashes to files like these.
Got identical results for the three, other than difference in space saved value.
Thanks for the help here, @emmamendelsohn. Could you zip up all three Parquet files and attach them here?
I managed to reproduce getting different checksums for files written using macOS and Linux and am attaching them here in case anyone wants to take a look: mtcars-parquet.zip. Both were written with arrow::write_parquet(mtcars, "mtcars.parquet", compression = "uncompressed")
using arrow R 14.0.0.2.
When I run parquet-tools inspect on each file with --detail, I get two differences in output. The first is some unlabeled number that's either 262658 or 262914 (diff of 256 which is a bit conspicuous) depending on the file and the second difference is in the KeyValue metadata for the ARROW:schema
key. I wonder if the two differences are related.
Here are the three files for the compressed example (arrow::write_parquet(mtcars, "mtcars.parquet")
). With --detail
I see there are differences in file and page offsets.
I am not surprised by difference in compression depending on the exact version of the compression library (Snappy), which also depends on the platform and the Arrow version numbers.
Ok, the uncompressed difference is in the R-specific metadata that's stored with Arrow tables. Either @nealrichardson @jonkeane or @paleolimbot would probably be able to explain what it's about, and why it may vary from platform to platform.
And, yeah, the format of the "r" metadata is very similar to the example showed in http://richfitz.github.io/redux/reference/object_to_string.html
Under PyArrow:
>>> a = pq.read_table("/home/antoine/arrow/data/mtcars-linux-uncompressed.parquet")
>>> b = pq.read_table("/home/antoine/arrow/data/mtcars-macos-uncompressed.parquet")
>>> a.schema.metadata
{b'r': b'A\n3\n262658\n197888\n5\nUTF-8\n531\n1\n531\n11\n254\n254\n254\n254\n254\n254\n254\n254\n254\n254\n254\n1026\n1\n262153\n5\nnames\n16\n11\n262153\n3\nmpg\n262153\n3\ncyl\n262153\n4\ndisp\n262153\n2\nhp\n262153\n4\ndrat\n262153\n2\nwt\n262153\n4\nqsec\n262153\n2\nvs\n262153\n2\nam\n262153\n4\ngear\n262153\n4\ncarb\n254\n1026\n511\n16\n1\n262153\n7\ncolumns\n254\n'}
>>> b.schema.metadata
{b'r': b'A\n3\n262914\n197888\n5\nUTF-8\n531\n1\n531\n11\n254\n254\n254\n254\n254\n254\n254\n254\n254\n254\n254\n1026\n1\n262153\n5\nnames\n16\n11\n262153\n3\nmpg\n262153\n3\ncyl\n262153\n4\ndisp\n262153\n2\nhp\n262153\n4\ndrat\n262153\n2\nwt\n262153\n4\nqsec\n262153\n2\nvs\n262153\n2\nam\n262153\n4\ngear\n262153\n4\ncarb\n254\n1026\n511\n16\n1\n262153\n7\ncolumns\n254\n'}
>>> a.schema.metadata == b.schema.metadata
False
By the way, 262658
is 0x40202 while 262914
is 0x40302, so this might very well be dependent on the R version you generated those files with (4.2.2 vs. 4.3.2?). Probably easy to verify.
All files from my example with R 4.3.2.
@emmamendelsohn Ah, I was talking about the uncompressed example from @amoeba . As I said above, differences in compressed files should not be a surprise. Do you still see differences if you generate uncompressed files?
I see. Yes for uncompressed we found Linux and Windows had the same hash, while macOS was different, all on 4.3.2. Let me know if you'd like me to share those files.
Thank you! Yes, you can share the Linux and macOS files for example.
(I suspect the final reason will be similar: slightly different R metadata serialized, for which I'll let R-Arrow experts answer :-))
Thanks for looking at this @pitrou, the R version and metadata causing the issue makes sense. I'll look into what we're doing in that regard next.
Actually, I was mistaken, all three systems have different hashes when uncompressed. This matches @amoeba's example above. uncompressed-mtcars-parquet.zip
Thanks @emmamendelsohn . After taking a quick look: 1) all three files differ only in the Parquet metadata, not the actual data 2) once deserialized, the Arrow schema is the same, except for R metadata (depending on R version perhaps: it might have been 4.3.3 on Linux vs. 4.3.2 on Windows and Mac?) 3) hence, most of the difference seems to be in the way the Arrow schema is serialized by flatbuffers. This is certainly harmless as long as the data is the same once deserialized.
Is there a particular reason you were wondering about these files being different?
This is an interesting flatbuffers commit message as we do have a similar piece of code. And binary inspection of the serialized Flatbuffers metadata seems to match this interpretation.
@pitrou the different hashes became an issue for our team using a collaborative R targets
workflow. In short, we use a shared S3 bucket for object storage so that each user can easily access the same versioned objects. This is especially useful for things like model objects that take a long time to produce. However, for large raw data files, we've found that the cost of transferring to/from AWS is too high, so each user saves the files locally as parquets. The targets
version tracking system needs to register that these local files have the expected hash to be able to run downstream endpoints. When the file hashes differ across systems, targets
detects a change and invalidates subsequent endpoints.
Anyway, we're rethinking some aspects of this approach, and so this may not be relevant in the future. Appreciate you looking into it nonetheless!
Yes, I think you should probably reconsider, because it is not realistic to expect a sophisticated compression-based format like Parquet to always generate the same bitwise data using slightly different producers.
Makes sense!
Would @nealrichardson @jonkeane or @paleolimbot be able to explain the R-specific metadata that generated maybe point to the code in the package where this occurs? From a quick inspection it looks a summary of the data frame schema in R's ASCII serialization format.
@noamross it looks like we do that here https://github.com/apache/arrow/blob/9ca7d787402c715ee84c1bb21cfca0e54ae2f12d/r/R/metadata.R#L19-L33
(calling into serialize
as you guessed)
@noamross IIRC the purpose of this is so that object attributes, including R class names, is preserved so that you can round-trip the data to parquet or arrow files and get the same R types back. If you had a bare data.frame and only vanilla R vector types, I would expect the metadata to be empty.
Describe the bug, including details regarding any error messages, version, and platform.
Moving this from ROpenSci slack. Our team has Mac, Linux, and Windows users, and we have found that we get three different hashes when saving parquet files.
Mac "05be83226acb5d2a673d922ff9f69414" Linux "8bddf47bdbede54d87ec3c4cbec280da" Windows "bef251d299843f07348248416572edab"
When uncompressed, we get the same hashes for Linux and Windows, different for Mac.
Mac "58ec2e7a6d614db15fc2123455a83a7e" Linux "4f3f049ffebdb395c489864e90d5e36b" Windows "4f3f049ffebdb395c489864e90d5e36b"
arrow_info()
for our three systems:Mac
``` Arrow package version: 14.0.0.2 Capabilities: acero TRUE dataset TRUE substrait FALSE parquet TRUE json TRUE s3 TRUE gcs TRUE utf8proc TRUE re2 TRUE snappy TRUE gzip TRUE brotli TRUE zstd TRUE lz4 TRUE lz4_frame TRUE lzo FALSE bz2 TRUE jemalloc TRUE mimalloc TRUE Memory: Allocator mimalloc Current 0 bytes Max 50.62 Kb Runtime: SIMD Level none Detected SIMD Level none Build: C++ Library Version 14.0.0 C++ Compiler AppleClang C++ Compiler Version 15.0.0.15000040 Git ID 2dcee3f82c6cf54b53a64729fd81840efa583244 ```Linux
``` Arrow package version: 14.0.0.2 Capabilities: acero TRUE dataset TRUE substrait FALSE parquet TRUE json TRUE s3 TRUE gcs TRUE utf8proc TRUE re2 TRUE snappy TRUE gzip TRUE brotli TRUE zstd TRUE lz4 TRUE lz4_frame TRUE lzo FALSE bz2 TRUE jemalloc TRUE mimalloc TRUE Memory: Allocator jemalloc Current 0 bytes Max 0 bytes Runtime: SIMD Level avx2 Detected SIMD Level avx2 Build: C++ Library Version 14.0.0 C++ Compiler GNU C++ Compiler Version 11.4.0 ```Windows
``` Arrow package version: 14.0.0.2 Capabilities: acero TRUE dataset TRUE substrait FALSE parquet TRUE json TRUE s3 TRUE gcs TRUE utf8proc TRUE re2 TRUE snappy TRUE gzip TRUE brotli TRUE zstd TRUE lz4 TRUE lz4_frame TRUE lzo FALSE bz2 TRUE jemalloc FALSE mimalloc TRUE Arrow options(): arrow.use_threads FALSE Memory: Allocator mimalloc Current 0 bytes Max 0 bytes Runtime: SIMD Level avx2 Detected SIMD Level avx2 Build: C++ Library Version 14.0.0 C++ Compiler GNU C++ Compiler Version 10.3.0 Git ID 2dcee3f82c6cf54b53a64729fd81840efa583244 ```Component(s)
Parquet, R