apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.63k stars 3.56k forks source link

[C++][Parquet] Footer parsing is broken on big-endian machines #44769

Closed QuLogic closed 1 day ago

QuLogic commented 3 days ago

Describe the bug, including details regarding any error messages, version, and platform.

On a big-endian machine, several tests fail like this:

[----------] 2 tests from TestBooleanRLE
[ RUN      ] TestBooleanRLE.TestBooleanScanner
unknown file: Failure
C++ exception with description "Invalid: Parquet file size is 192 bytes, smaller than the size reported by footer's (1862270976bytes)" thrown in SetUp().

[  FAILED  ] TestBooleanRLE.TestBooleanScanner (0 ms)
[ RUN      ] TestBooleanRLE.TestBatchRead
unknown file: Failure
C++ exception with description "Invalid: Parquet file size is 192 bytes, smaller than the size reported by footer's (1862270976bytes)" thrown in SetUp().

[  FAILED  ] TestBooleanRLE.TestBatchRead (0 ms)
[----------] 2 tests from TestBooleanRLE (1 ms total)

This occurs for several tests, which I did not copy out, but they are all similar. This is pretty indicative of incorrect endianness, as byteswapping 1862270976 produces 111, which is smaller than the file size of 192 bytes.

Component(s)

Parquet

QuLogic commented 3 days ago

From the Parquest file format, it appears the file metadata length should always be little endian. With this patch:

diff --git a/cpp/src/parquet/file_reader.cc b/cpp/src/parquet/file_reader.cc
index 3e9eeea6c..7585afcc0 100644
--- a/cpp/src/parquet/file_reader.cc
+++ b/cpp/src/parquet/file_reader.cc
@@ -497,9 +497,10 @@ class SerializedFile : public ParquetFileReader::Contents {
           "is not a parquet file.");
     }
     // Both encrypted/unencrypted footers have the same footer length check.
-    uint32_t metadata_len = ::arrow::util::SafeLoadAs<uint32_t>(
-        reinterpret_cast<const uint8_t*>(footer_buffer->data()) + footer_read_size -
-        kFooterSize);
+    uint32_t metadata_len = ::arrow::bit_util::FromLittleEndian(
+        ::arrow::util::SafeLoadAs<uint32_t>(
+            reinterpret_cast<const uint8_t*>(footer_buffer->data()) + footer_read_size -
+            kFooterSize));
     if (metadata_len > source_size_ - kFooterSize) {
       throw ParquetInvalidOrCorruptedFileException(
           "Parquet file size is ", source_size_,
diff --git a/cpp/src/parquet/file_writer.cc b/cpp/src/parquet/file_writer.cc
index baa9e00da..695347d8c 100644
--- a/cpp/src/parquet/file_writer.cc
+++ b/cpp/src/parquet/file_writer.cc
@@ -539,6 +539,7 @@ void WriteFileMetaData(const FileMetaData& file_metadata, ArrowOutputStream* sin
   metadata_len = static_cast<uint32_t>(position) - metadata_len;

   // Write Footer
+  metadata_len = ::arrow::bit_util::ToLittleEndian(metadata_len);
   PARQUET_THROW_NOT_OK(sink->Write(reinterpret_cast<uint8_t*>(&metadata_len), 4));
   PARQUET_THROW_NOT_OK(sink->Write(kParquetMagic, 4));
 }
@@ -562,6 +563,7 @@ void WriteEncryptedFileMetadata(const FileMetaData& file_metadata,
     PARQUET_ASSIGN_OR_THROW(position, sink->Tell());
     metadata_len = static_cast<uint32_t>(position) - metadata_len;

+    metadata_len = ::arrow::bit_util::ToLittleEndian(metadata_len);
     PARQUET_THROW_NOT_OK(sink->Write(reinterpret_cast<uint8_t*>(&metadata_len), 4));
     PARQUET_THROW_NOT_OK(sink->Write(kParquetMagic, 4));
   }

I can fix most of these occurrences, except for tests for encryption, and I don't know where I've missed for that.

mapleFU commented 2 days ago

Good catch!

mapleFU commented 1 day ago

Issue resolved by pull request 44787 https://github.com/apache/arrow/pull/44787