apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.46k stars 3.52k forks source link

[Rust] [Parquet] Panic when reading Parquet file produced with parquet-cpp #26519

Closed asfimport closed 3 years ago

asfimport commented 3 years ago

See attached Parquet file that was created with parquet-cpp.

The file metadata is:

 

creator: parquet-cpp version 1.5.1-SNAPSHOT

file schema: schema

sys_isSystemRelocated: OPTIONAL INT64 R:0 D:1 __sys_schemaId: OPTIONAL INT64 R:0 D:1 sys_invOffsetLSID: OPTIONAL INT64 R:0 D:1 sys_invOffsetGroupIdx: OPTIONAL INT64 R:0 D:1 sys_invOffsetRecordIdx: OPTIONAL INT64 R:0 D:1 _rid: OPTIONAL BINARY L:STRING R:0 D:1 sys_sequenceNumber: OPTIONAL INT64 R:0 D:1 sys_recordIndex: OPTIONAL INT64 R:0 D:1 __sys_isTombstone: OPTIONAL INT64 R:0 D:1 _ts: OPTIONAL INT64 R:0 D:1 partitionKey: OPTIONAL BINARY L:STRING R:0 D:1 entityType: OPTIONAL BINARY L:STRING R:0 D:1 ttl: OPTIONAL INT64 R:0 D:1 tripId: OPTIONAL INT32 R:0 D:1 vin: OPTIONAL BINARY L:STRING R:0 D:1 state: OPTIONAL BINARY L:STRING R:0 D:1 region: OPTIONAL INT32 R:0 D:1 outsideTemperature: OPTIONAL INT64 R:0 D:1 engineTemperature: OPTIONAL INT64 R:0 D:1 speed: OPTIONAL INT64 R:0 D:1 fuel: OPTIONAL INT64 R:0 D:1 fuelRate: OPTIONAL DOUBLE R:0 D:1 engineoil: OPTIONAL INT64 R:0 D:1 tirepressure: OPTIONAL INT64 R:0 D:1 odometer: OPTIONAL DOUBLE R:0 D:1 accelerator_pedal_position: OPTIONAL INT64 R:0 D:1 parking_brake_status: OPTIONAL BOOLEAN R:0 D:1 brake_pedal_status: OPTIONAL BOOLEAN R:0 D:1 headlamp_status: OPTIONAL BOOLEAN R:0 D:1 transmission_gear_position: OPTIONAL INT64 R:0 D:1 ignition_status: OPTIONAL BOOLEAN R:0 D:1 windshield_wiper_status: OPTIONAL BOOLEAN R:0 D:1 abs: OPTIONAL BOOLEAN R:0 D:1 refrigerationUnitKw: OPTIONAL DOUBLE R:0 D:1 refrigerationUnitTemp: OPTIONAL DOUBLE R:0 D:1 timestamp: OPTIONAL BINARY L:STRING R:0 D:1 id: OPTIONAL BINARY L:STRING R:0 D:1 _etag: OPTIONAL BINARY L:STRING R:0 D:1 __sys_value: OPTIONAL BINARY L:STRING R:0 D:1

row group 1: RC:27150 TS:2481123 OFFSET:4

sys_isSystemRelocated: INT64 SNAPPY DO:4 FPO:28 SZ:102/98/0.96 VC:27150 ENC:PLAIN,PLAIN_DICTIONARY,RLE ST:[min: 0, max: 0, num_nulls: 0] __sys_schemaId: INT64 SNAPPY DO:205 FPO:220 SZ:51/48/0.94 VC:27150 ENC:PLAIN,PLAIN_DICTIONARY,RLE ST:[num_nulls: 27150, min/max not defined] sys_invOffsetLSID: INT64 SNAPPY DO:308 FPO:323 SZ:51/48/0.94 VC:27150 ENC:PLAIN,PLAIN_DICTIONARY,RLE ST:[num_nulls: 27150, min/max not defined] sys_invOffsetGroupIdx: INT64 SNAPPY DO:416 FPO:431 SZ:51/48/0.94 VC:27150 ENC:PLAIN,PLAIN_DICTIONARY,RLE ST:[num_nulls: 27150, min/max not defined] sys_invOffsetRecordIdx: INT64 SNAPPY DO:528 FPO:543 SZ:51/48/0.94 VC:27150 ENC:PLAIN,PLAIN_DICTIONARY,RLE ST:[num_nulls: 27150, min/max not defined] _rid: BINARY SNAPPY DO:641 FPO:137000 SZ:187417/811272/4.33 VC:27150 ENC:PLAIN,PLAIN_DICTIONARY,RLE ST:[min: o9dcAMA1y14+AAAAAAAABA==, max: o9dcAMA1y17zaQAAAAAABA==, num_nulls: 0] sys_sequenceNumber: INT64 SNAPPY DO:188156 FPO:296856 SZ:159746/268260/1.68 VC:27150 ENC:PLAIN,PLAIN_DICTIONARY,RLE ST:[min: 3, max: 27152, num_nulls: 0] sys_recordIndex: INT64 SNAPPY DO:348005 FPO:456699 SZ:159740/268260/1.68 VC:27150 ENC:PLAIN,PLAIN_DICTIONARY,RLE ST:[min: 0, max: 27149, num_nulls: 0] __sys_isTombstone: INT64 SNAPPY DO:507845 FPO:507860 SZ:51/48/0.94 VC:27150 ENC:PLAIN,PLAIN_DICTIONARY,RLE ST:[num_nulls: 27150, min/max not defined] _ts: INT64 SNAPPY DO:507954 FPO:510167 SZ:3974/6137/1.54 VC:27150 ENC:PLAIN,PLAIN_DICTIONARY,RLE ST:[min: 1597365315, max: 1597365859, num_nulls: 0] partitionKey: BINARY SNAPPY DO:512012 FPO:512256 SZ:13967/14026/1.00 VC:27150 ENC:PLAIN,PLAIN_DICTIONARY,RLE ST:[min: 0A4SMSAGR5CA4LAY6-2020-08, max: YKO1Q8RX7Z20BVBG0-2020-08, num_nulls: 0] entityType: BINARY SNAPPY DO:526088 FPO:526124 SZ:110/106/0.96 VC:27150 ENC:PLAIN,PLAIN_DICTIONARY,RLE ST:[min: VehicleTelemetry, max: VehicleTelemetry, num_nulls: 0] ttl: INT64 SNAPPY DO:526285 FPO:526309 SZ:102/98/0.96 VC:27150 ENC:PLAIN,PLAIN_DICTIONARY,RLE ST:[min: 5184000, max: 5184000, num_nulls: 0] tripId: INT32 SNAPPY DO:526471 FPO:526491 SZ:56/52/0.93 VC:27150 ENC:PLAIN,PLAIN_DICTIONARY,RLE ST:[no stats for this column] vin: BINARY SNAPPY DO:526568 FPO:526787 SZ:13926/13930/1.00 VC:27150 ENC:PLAIN,PLAIN_DICTIONARY,RLE ST:[min: 0A4SMSAGR5CA4LAY6, max: YKO1Q8RX7Z20BVBG0, num_nulls: 0] state: BINARY SNAPPY DO:540578 FPO:540647 SZ:13746/13748/1.00 VC:27150 ENC:PLAIN,PLAIN_DICTIONARY,RLE ST:[min: AL, max: WI, num_nulls: 0] region: INT32 SNAPPY DO:554380 FPO:554400 SZ:56/52/0.93 VC:27150 ENC:PLAIN,PLAIN_DICTIONARY,RLE ST:[no stats for this column] outsideTemperature: INT64 SNAPPY DO:554477 FPO:554544 SZ:13776/13801/1.00 VC:27150 ENC:PLAIN,PLAIN_DICTIONARY,RLE ST:[min: 2, max: 100, num_nulls: 0] engineTemperature: INT64 SNAPPY DO:568354 FPO:570383 SZ:32711/34701/1.06 VC:27150 ENC:PLAIN,PLAIN_DICTIONARY,RLE ST:[min: 0, max: 500, num_nulls: 0] speed: INT64 SNAPPY DO:601165 FPO:601597 SZ:24326/24713/1.02 VC:27150 ENC:PLAIN,PLAIN_DICTIONARY,RLE ST:[min: 0, max: 100, num_nulls: 0] fuel: INT64 SNAPPY DO:625579 FPO:625766 SZ:20687/20830/1.01 VC:27150 ENC:PLAIN,PLAIN_DICTIONARY,RLE ST:[min: 0, max: 39, num_nulls: 0] fuelRate: DOUBLE SNAPPY DO:646353 FPO:648774 SZ:36497/38895/1.07 VC:27150 ENC:PLAIN,PLAIN_DICTIONARY,RLE ST:[min: 8.0, max: 14.0, num_nulls: 0] engineoil: INT64 SNAPPY DO:682941 FPO:683172 SZ:20731/20918/1.01 VC:27150 ENC:PLAIN,PLAIN_DICTIONARY,RLE ST:[min: 0, max: 50, num_nulls: 0] tirepressure: INT64 SNAPPY DO:703764 FPO:703995 SZ:20731/20918/1.01 VC:27150 ENC:PLAIN,PLAIN_DICTIONARY,RLE ST:[min: 0, max: 50, num_nulls: 0] odometer: DOUBLE SNAPPY DO:724590 FPO:762499 SZ:85561/121114/1.42 VC:27150 ENC:PLAIN,PLAIN_DICTIONARY,RLE ST:[min: 36306.0, max: 209363.94, num_nulls: 0] accelerator_pedal_position: INT64 SNAPPY DO:810242 FPO:810670 SZ:24322/24705/1.02 VC:27150 ENC:PLAIN,PLAIN_DICTIONARY,RLE ST:[min: 0, max: 99, num_nulls: 0] parking_brake_status: BOOLEAN SNAPPY DO:834673 FPO:834673 SZ:3444/3439/1.00 VC:27150 ENC:PLAIN,RLE ST:[min: false, max: true, num_nulls: 0] brake_pedal_status: BOOLEAN SNAPPY DO:838189 FPO:838189 SZ:3444/3439/1.00 VC:27150 ENC:PLAIN,RLE ST:[min: false, max: true, num_nulls: 0] headlamp_status: BOOLEAN SNAPPY DO:841703 FPO:841703 SZ:3444/3439/1.00 VC:27150 ENC:PLAIN,RLE ST:[min: false, max: true, num_nulls: 0] transmission_gear_position: INT64 SNAPPY DO:845214 FPO:845268 SZ:10371/10382/1.00 VC:27150 ENC:PLAIN,PLAIN_DICTIONARY,RLE ST:[min: 1, max: 7, num_nulls: 0] ignition_status: BOOLEAN SNAPPY DO:855694 FPO:855694 SZ:3444/3439/1.00 VC:27150 ENC:PLAIN,RLE ST:[min: false, max: true, num_nulls: 0] windshield_wiper_status: BOOLEAN SNAPPY DO:859205 FPO:859205 SZ:3444/3439/1.00 VC:27150 ENC:PLAIN,RLE ST:[min: false, max: true, num_nulls: 0] abs: BOOLEAN SNAPPY DO:862724 FPO:862724 SZ:3444/3439/1.00 VC:27150 ENC:PLAIN,RLE ST:[min: false, max: true, num_nulls: 0] refrigerationUnitKw: DOUBLE SNAPPY DO:866223 FPO:870944 SZ:42191/46450/1.10 VC:27150 ENC:PLAIN,PLAIN_DICTIONARY,RLE ST:[min: 13.07, max: 73.69, num_nulls: 0] refrigerationUnitTemp: DOUBLE SNAPPY DO:908516 FPO:912705 SZ:38265/42143/1.10 VC:27150 ENC:PLAIN,PLAIN_DICTIONARY,RLE ST:[min: 15.76, max: 58.86, num_nulls: 0] timestamp: BINARY SNAPPY DO:946885 FPO:1206643 SZ:310824/916935/2.95 VC:27150 ENC:PLAIN,PLAIN_DICTIONARY,RLE ST:[min: 2020-08-14T00:35:14.7359902Z, max: 2020-08-14T00:44:19.8774902Z, num_nulls: 0] id: BINARY SNAPPY DO:1257823 FPO:2203836 SZ:1029368/1135450/1.10 VC:27150 ENC:PLAIN,PLAIN_DICTIONARY,RLE ST:[min: 00015f80-038f-48a3-b95f-74af977b1c50, max: fffffc48-18e2-43e9-ab51-2f166 7286ef2, num_nulls: 0] _etag: BINARY SNAPPY DO:2287317 FPO:2425082 SZ:196850/1187415/6.03 VC:27150 ENC:PLAIN,PLAIN_DICTIONARY,RLE ST:[min: "380100df-0000-0700-0000-5f35dc440000", max: "3901ffc6-0000-0700-0000-5f3 5de5b0000", num_nulls: 0] __sys_value: BINARY SNAPPY DO:2484300 FPO:2484315 SZ:51/48/0.94 VC:27150 ENC:PLAIN,PLAIN_DICTIONARY,RLE ST:[num_nulls: 27150, min/max not defined]

 

I'm using the following Rust code for reading the Parquet file metadata:

https://github.com/spektom/parquet-rs-test/blob/master/src/main.rs

 

Here's the exception I'm getting:

 

thread 'main' panicked at 'Error when parsing Parquet file: Parquet error: Could not parse metadata: bad data', src\main.rs:20:19 stack backtrace: 0: backtrace::backtrace::trace_unsynchronized at C:\Users\VssAdministrator.cargo\registry\src\github.com-1ecc6299db9ec823\backtrace-0.3.46\src\backtrace\mod.rs:66 1: std::sys_common::backtrace::_print_fmt at /rustc/65b448273dd280401cd440a6740a7cd891525ba3\/src\libstd\sys_common\backtrace.rs:78 2: std::sys_common::backtrace::_print::impl::fmt at /rustc/65b448273dd280401cd440a6740a7cd891525ba3\/src\libstd\sys_common\backtrace.rs:59 3: core::fmt::write at /rustc/65b448273dd280401cd440a6740a7cd891525ba3\/src\libcore\fmt\mod.rs:1069 4: std::io::Write::write_fmt at /rustc/65b448273dd280401cd440a6740a7cd891525ba3\/src\libstd\io\mod.rs:1532 5: std::sys_common::backtrace::_print at /rustc/65b448273dd280401cd440a6740a7cd891525ba3\/src\libstd\sys_common\backtrace.rs:62 6: std::sys_common::backtrace::print at /rustc/65b448273dd280401cd440a6740a7cd891525ba3\/src\libstd\sys_common\backtrace.rs:49 7: std::panicking::default_hook::closure at /rustc/65b448273dd280401cd440a6740a7cd891525ba3\/src\libstd\panicking.rs:198 8: std::panicking::default_hook at /rustc/65b448273dd280401cd440a6740a7cd891525ba3\/src\libstd\panicking.rs:218 9: std::panicking::rust_panic_with_hook at /rustc/65b448273dd280401cd440a6740a7cd891525ba3\/src\libstd\panicking.rs:477 10: std::panicking::begin_panic_handler at /rustc/65b448273dd280401cd440a6740a7cd891525ba3\/src\libstd\panicking.rs:385 11: std::panicking::begin_panic_fmt at /rustc/65b448273dd280401cd440a6740a7cd891525ba3\/src\libstd\panicking.rs:339 12: parquet_rs_test::main at .\src\main.rs:20 13: std::rt::lang_start::closure<()> at /rustc/65b448273dd280401cd440a6740a7cd891525ba3\src\libstd\rt.rs:67 14: std::rt::lang_start_internal::closure at /rustc/65b448273dd280401cd440a6740a7cd891525ba3\/src\libstd\rt.rs:52 15: std::panicking::try::do_call at /rustc/65b448273dd280401cd440a6740a7cd891525ba3\/src\libstd\panicking.rs:297 16: std::panicking::try at /rustc/65b448273dd280401cd440a6740a7cd891525ba3\/src\libstd\panicking.rs:274 17: std::panic::catch_unwind at /rustc/65b448273dd280401cd440a6740a7cd891525ba3\/src\libstd\panic.rs:394 18: std::rt::lang_start_internal at /rustc/65b448273dd280401cd440a6740a7cd891525ba3\/src\libstd\rt.rs:51 19: std::rt::lang_start<()> at /rustc/65b448273dd280401cd440a6740a7cd891525ba3\src\libstd\rt.rs:67 20: main 21: invoke_main at D:\agent_work\9\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl:78 22: __scrt_common_main_seh at D:\agent_work\9\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl:288 23: BaseThreadInitThunk 24: RtlUserThreadStart note: Some details are omitted, run with RUST_BACKTRACE=full for a verbose backtrace.

Environment: Windows 10 x86_64 Cargo nightly Reporter: Michael Spector

Related issues:

Note: This issue was originally created as ARROW-10553. Please see the migration documentation for further details.

asfimport commented 3 years ago

Rémi Dettai / @rdettai: Hi Michael! Thanks for reporting this. Have you tried writing a file with only a subset of the columns to identify where the problem might be? When printing the schema with pyarrow, I see that some logical types are not recognized, could something be going wrong there?


  optional int64 field_id=13 ttl;
  optional int32 field_id=14 tripId (UNKNOWN);
  optional binary field_id=15 vin (String);
  optional binary field_id=16 state (String);
  optional int32 field_id=17 region (UNKNOWN);
  optional int64 field_id=18 outsideTemperature;
  optional int64 field_id=19 engineTemperature;
asfimport commented 3 years ago

Michael Spector: Hi Remi,

I don't have an access to the code that produced this Parquet file, I am working on the consuming part, which is Rust based.

Can you tell what doss it mean if Pyarrow is unable to recognize some logical types? Is it that Pyarrow is state of the art Parquet reader, and if it's unable to see some logical field types then they are wrong?

Also, inability to recognize logical types shouldn't crash Parquet readers, and physical types should be used instead in such cases. Is my understanding correct?

asfimport commented 3 years ago

Jörn Horstmann / @jhorstmann: I added some debug output in the code and got some more details about the error:


ProtocolError {
    kind: InvalidData,
    message: "cannot convert enum constant 24 to ConvertedType",
}

The corresponding field name seems to be "tripId".

Seems to correspond to this type in the thrift definitions: https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L48 The enum values in that file only go to 21

asfimport commented 3 years ago

Michael Spector: Thanks Jörn! Shouldn't inability to recognize logical type (either because it's custom logical type or because the Parquet reader uses older Thrift definition) fallback to using physical types?

asfimport commented 3 years ago

Rémi Dettai / @rdettai:

Is it that Pyarrow is state of the art Parquet reader, and if it's unable to see some logical field types then they are wrong? pyarrow uses parquet-cpp, so I was just checking that it could at least read itself :) Shouldn't inability to recognize logical type fallback to using physical types In my opinion it would be good for forward compatibility yes, but I am not sure that custom logical types are very legal in parquet (I don't see any mention of customization in [1])

We are back to the issue that the code for the thrift definition is in an external crate [2], so modifying this is a bit painful.

[1] https://github.com/apache/parquet-format/blob/master/LogicalTypes.md [2] https://issues.apache.org/jira/browse/ARROW-6256

asfimport commented 3 years ago

Rémi Dettai / @rdettai: I looked a little bit into thrift and the c++ vs the rust generator. The thing is that the Rust implem decided that:

A Thrift enum is represented as a Rust enum, and each variant is transcribed 1:1.  That leaves little room for the serialization of unknown value. In C++, enums are more loosely constrained, and I guess that this is how pyarrow managed to parse your file.

Either I have missed out something, or the Rust implem of thrift that we are using is not forward compatible... I asked the question on the thrift Jira (https://issues.apache.org/jira/browse/THRIFT-5314), we'll see what they say!

asfimport commented 3 years ago

Rémi Dettai / @rdettai: [~spektom] It is confirmed that forward compability is an issue with the Rust implem of Thrift. I guess that until they sort out this problem, it is going to be hard to provide a fix that would make your usecase work (again, not sure custom logical types are really legal...).

Following steps on arrows side (no immediate action):

asfimport commented 3 years ago

Neville Dipale / @nevi-me: Part of the problem here seems to be that the C++ implementation's ConvertedType enum has additional values that are not in the spec (26 values vs 21). I'm not sure when parquet-cpp 1.5.1 was released, or if this issue was resolved. It looks like a bug to me.

asfimport commented 3 years ago

Micah Kornfield / @emkornfield: Yes I agree there is a bug in the C++ implementation.  The extra types in the enum are meant to be sentinels, but it seems like they are leaking into writing.

asfimport commented 3 years ago

Micah Kornfield / @emkornfield: Specifically ConvertedType::NA is somehow getting written out (there is a conversion that subtracts one).

asfimport commented 3 years ago

Micah Kornfield / @emkornfield: Opened https://issues.apache.org/jira/browse/PARQUET-1990 for C++.

asfimport commented 3 years ago

Micah Kornfield / @emkornfield: At the moment I'm not immediately seeing how a file like this could be generated other than a potential (not very well documented) misuse of the API or with very old code arrow cod (https://github.com/apache/arrow/commit/d54425de19b7dbb2764a40355d76d1c785cf64ec#diff-96f0dce6d1156ca4fc908d699f4600e5811dcf482e9601695dc32d8910709c58 makes a change which should have prevented this).

 

Is this from a third-party service of some sort?  Or was it generated with 1+ year old version of Arrow?

asfimport commented 3 years ago

Micah Kornfield / @emkornfield: @rdettai  in regards to invalid enums, the C++ implementation has put some effort into not producing UBSan when enums are bad.  See https://github.com/apache/arrow/blob/25c736d48dc289f457e74d15d05db65f6d539447/cpp/src/parquet/thrift_internal.h#L158 however in this case C++ sees it as a enum that it converted.  Is rust using the fuzzing files (or would it like to be added to fuzz tests in general) to ensure panics don't occur?

asfimport commented 3 years ago

Neville Dipale / @nevi-me: Thanks for looking @emkornfield. The version is at worst from Sep 2018 (https://github.com/apache/parquet-cpp/commit/642da055adf009652689b20e68a198cffb857651) as that's when 1.5.1-SNAPSHOT was bumped. So, before parquet-cpp was moved into arrow.

It doesn't seem likely that the ConvertedType would ever be increased or modified, as it's considered legacy. So, would it not be better to treat it as 21 values, without the sentinels? I'm of course not familiar enough with C++ and how enums work, so I could be wrong on whether such treatment could work.

asfimport commented 3 years ago

Neville Dipale / @nevi-me: @emkornfield The Parquet version hasn't been updated from 1.5.1-SNAPSHOT. A pyarrow==3.0.0 file still shows the old version metadata. So, the file could have been written anytime. https://issues.apache.org/jira/browse/ARROW-7830

asfimport commented 3 years ago

Micah Kornfield / @emkornfield:

So, would it not be better to treat it as 21 values, without the sentinels? I'm of course not familiar enough with C++ and how enums work, so I could be wrong on whether such treatment could work.

 

[~nevi_me] so the sentinel in question at one point was used for distinguishing null column types (this isn't compliant with the parquet standard).  As I noted this isn't written out any more as far as I can tell and might not be necessary.   the Undefined sentinel is specifically to avoid undefined behavior when trying to load illegal enum values.  So other then adding additional guards in C++ there isn't much that can be done.  To try to be clearer I think there are potentially 3 issues:

1.  Ensure C++ never writes out NA.

2.  How rust should handle invalid files (i interpret panics as crashes but if they are just exceptions then maybe more work doesn't have to be done).

3.  How to handle compatibility with files that did receive NA (I opened PARQUET-1991 for this).

asfimport commented 3 years ago

Neville Dipale / @nevi-me: I'm closing this as the core issue is that we're trying to read a converted type that doesn't exist in the format. There's some compatibility concerns as mentioned in the comments, but those will depend on THRIFT-5314.

PARQUET-1990 and PARQUET-1991 will address the cause of the problem in parquet-cpp.