Closed QuLogic closed 1 day ago
From the Parquest file format, it appears the file metadata length should always be little endian. With this patch:
diff --git a/cpp/src/parquet/file_reader.cc b/cpp/src/parquet/file_reader.cc
index 3e9eeea6c..7585afcc0 100644
--- a/cpp/src/parquet/file_reader.cc
+++ b/cpp/src/parquet/file_reader.cc
@@ -497,9 +497,10 @@ class SerializedFile : public ParquetFileReader::Contents {
"is not a parquet file.");
}
// Both encrypted/unencrypted footers have the same footer length check.
- uint32_t metadata_len = ::arrow::util::SafeLoadAs<uint32_t>(
- reinterpret_cast<const uint8_t*>(footer_buffer->data()) + footer_read_size -
- kFooterSize);
+ uint32_t metadata_len = ::arrow::bit_util::FromLittleEndian(
+ ::arrow::util::SafeLoadAs<uint32_t>(
+ reinterpret_cast<const uint8_t*>(footer_buffer->data()) + footer_read_size -
+ kFooterSize));
if (metadata_len > source_size_ - kFooterSize) {
throw ParquetInvalidOrCorruptedFileException(
"Parquet file size is ", source_size_,
diff --git a/cpp/src/parquet/file_writer.cc b/cpp/src/parquet/file_writer.cc
index baa9e00da..695347d8c 100644
--- a/cpp/src/parquet/file_writer.cc
+++ b/cpp/src/parquet/file_writer.cc
@@ -539,6 +539,7 @@ void WriteFileMetaData(const FileMetaData& file_metadata, ArrowOutputStream* sin
metadata_len = static_cast<uint32_t>(position) - metadata_len;
// Write Footer
+ metadata_len = ::arrow::bit_util::ToLittleEndian(metadata_len);
PARQUET_THROW_NOT_OK(sink->Write(reinterpret_cast<uint8_t*>(&metadata_len), 4));
PARQUET_THROW_NOT_OK(sink->Write(kParquetMagic, 4));
}
@@ -562,6 +563,7 @@ void WriteEncryptedFileMetadata(const FileMetaData& file_metadata,
PARQUET_ASSIGN_OR_THROW(position, sink->Tell());
metadata_len = static_cast<uint32_t>(position) - metadata_len;
+ metadata_len = ::arrow::bit_util::ToLittleEndian(metadata_len);
PARQUET_THROW_NOT_OK(sink->Write(reinterpret_cast<uint8_t*>(&metadata_len), 4));
PARQUET_THROW_NOT_OK(sink->Write(kParquetMagic, 4));
}
I can fix most of these occurrences, except for tests for encryption, and I don't know where I've missed for that.
Good catch!
Issue resolved by pull request 44787 https://github.com/apache/arrow/pull/44787
Describe the bug, including details regarding any error messages, version, and platform.
On a big-endian machine, several tests fail like this:
This occurs for several tests, which I did not copy out, but they are all similar. This is pretty indicative of incorrect endianness, as byteswapping 1862270976 produces 111, which is smaller than the file size of 192 bytes.
Component(s)
Parquet