Add Parquet -> JSON Converter

KIwabuchi commented 1 year ago

This PR adds a function (read_parquet_as_json) that converts Parquet data to Boost.JSON object.
This PR also modifies the Parquet parser (arrow_parquet_parser) to return Parquet physical types rather than Parquet logical types.
- Physical type vs logical type: https://www.mathworks.com/help/matlab/import_export/datatype-mappings-matlab-parquet.html
- I minimized the API change — I didn't have to make any changes to .cpp files that use the parser.
The current converter implementation does not support the FIXED_LEN_BYTE_ARRAY type because it requires fundamental changes in our parquet reader.
- Apparently, we need to specify the length at compile-time to read FIXED_LEN_BYTE_ARRAY data from parquet::StreamReader.

Questions

Which Arrow version do we need to support at least?
- So far, I have tested with only the latest version (v13)
The parser class and converter function are designed for only the Parquet plain encoding (primitive types) and simple column structure (nested or hierarchical columns). Do we need to support more complex Parquet files soon?

KIwabuchi commented 1 year ago

FYI, I opened this PR so that we can discuss about Parquet reader — I don't expect it to be merged soon:)

steiltre commented 1 year ago

Thanks for this Keita. I think this looks good and see the utility of being able to treat Parquet files an JSON in a uniform manner.

The change regarding switching to physical types from logical doesn't affect anything other than the output if you wanted to print the schema, correct? Looking at the table in your link, the Parquet physical types have more info than the Parquet logical types, so I like this change. Just want to make sure there aren't any implications I missed.

I believe all versions starting with Apache Arrow 8.0 were compatible with Tahsin's original implementation. I played around with more extensive testing of Arrow last week, but Apache Arrow doesn't seem to make it easy to install an old version. I was able to get our CI building old versions from source, but the Github Actions were timing out because this takes too long.

For now, I say we can have our CI jobs testing the most recent version available through apt. We can set up some additional tests through Gitlab using Spack if we need to support specific versions of Apache Arrow. I'm going to hold off on doing that for the time being.

I don't have any comments about more complex Parquet files. I'll leave that to @rogerpearce if he has any comments.

KIwabuchi commented 1 year ago

Hey Trevor,

The change regarding switching to physical types from logical doesn't affect anything other than the output if you wanted to print the schema, correct?

Yes, that's correct. You can still get logical information from schema_to_string().

For now, I say we can have our CI jobs testing the most recent version available through apt. We can set up some additional tests through Gitlab using Spack if we need to support specific versions of Apache Arrow. I'm going to hold off on doing that for the time being.

Sounds good to me! I'll try to find a workaround for the FIXED_LEN_BYTE_ARRAY type and modify this PR if it goes well. I'll test with older versions.

Here are the outputs of the old and the new versions of ygm/examples/io/arrow_parquet_stream_reader.cpp.

Only the 5th lines are different, which use the schema() function I changed in this PR. Lines 6–12 are the outputs of schema_to_string(). As you can see, you can still print out the logical type information using this function.

Old ver:

Arrow Parquet file parser example
4 files in ../test/data/parquet_files/
#Fields: 5
Schema:
String:Make_and_model_string, None:Country_char[4], Int(bitWidth=64, isSigned=false):Top_speed_uint64, None:0-60_time_double, None:EV_or_not_boolean,
required group field_id=-1 schema {
  optional binary field_id=-1 Make_and_model_string (String);
  optional fixed_len_byte_array(4) field_id=-1 Country_char[4];
  optional int64 field_id=-1 Top_speed_uint64 (Int(bitWidth=64, isSigned=false));
  optional double field_id=-1 0-60_time_double;
  optional boolean field_id=-1 EV_or_not_boolean;
}

#Rows: 12
(Make_and_model_string) (Country_char[4]) (Top_speed_uint64) (0-60_time_double) (EV_or_not_boolean)
0: Koenigsegg Jesko Absolut, SWE, 330, 2.3, false
0: Hennessey Venom F5, USA, 311, 2.4, false
0: Bugatti Bolide, GER, 310, 2.17, false
1: Bugatti Chiron Super Sport 300+, GER, 304, 2.3, false
1: SSC Tuatara, USA, 283, 2.5, false
1: Rimac Nevera, USA, 258, 1.97, true

New ver:

Arrow Parquet file parser example
4 files in ../test/data/parquet_files/
#Fields: 5
Schema:
BYTE_ARRAY:Make_and_model_string, FIXED_LEN_BYTE_ARRAY:Country_char[4], INT64:Top_speed_uint64, DOUBLE:0-60_time_double, BOOLEAN:EV_or_not_boolean,
required group field_id=-1 schema {
  optional binary field_id=-1 Make_and_model_string (String);
  optional fixed_len_byte_array(4) field_id=-1 Country_char[4];
  optional int64 field_id=-1 Top_speed_uint64 (Int(bitWidth=64, isSigned=false));
  optional double field_id=-1 0-60_time_double;
  optional boolean field_id=-1 EV_or_not_boolean;
}

#Rows: 12
(Make_and_model_string) (Country_char[4]) (Top_speed_uint64) (0-60_time_double) (EV_or_not_boolean)
0: Koenigsegg Jesko Absolut, SWE, 330, 2.3, false
0: Hennessey Venom F5, USA, 311, 2.4, false
0: Bugatti Bolide, GER, 310, 2.17, false
1: Bugatti Chiron Super Sport 300+, GER, 304, 2.3, false
1: SSC Tuatara, USA, 283, 2.5, false
1: Rimac Nevera, USA, 258, 1.97, true

KIwabuchi commented 1 year ago

We decided not to support FIXED_LEN_BYTE_ARRAY for now. @steiltre This PR is ready for review. Let me know if you have any comments!

steiltre commented 1 year ago

This looks good. Merging in to develop.

LLNL / ygm

Add Parquet -> JSON Converter #181

Questions