Closed KIwabuchi closed 1 year ago
FYI, I opened this PR so that we can discuss about Parquet reader — I don't expect it to be merged soon:)
Thanks for this Keita. I think this looks good and see the utility of being able to treat Parquet files an JSON in a uniform manner.
The change regarding switching to physical types from logical doesn't affect anything other than the output if you wanted to print the schema, correct? Looking at the table in your link, the Parquet physical types have more info than the Parquet logical types, so I like this change. Just want to make sure there aren't any implications I missed.
I believe all versions starting with Apache Arrow 8.0 were compatible with Tahsin's original implementation. I played around with more extensive testing of Arrow last week, but Apache Arrow doesn't seem to make it easy to install an old version. I was able to get our CI building old versions from source, but the Github Actions were timing out because this takes too long.
For now, I say we can have our CI jobs testing the most recent version available through apt. We can set up some additional tests through Gitlab using Spack if we need to support specific versions of Apache Arrow. I'm going to hold off on doing that for the time being.
I don't have any comments about more complex Parquet files. I'll leave that to @rogerpearce if he has any comments.
Hey Trevor,
The change regarding switching to physical types from logical doesn't affect anything other than the output if you wanted to print the schema, correct?
Yes, that's correct. You can still get logical information from schema_to_string().
For now, I say we can have our CI jobs testing the most recent version available through apt. We can set up some additional tests through Gitlab using Spack if we need to support specific versions of Apache Arrow. I'm going to hold off on doing that for the time being.
Sounds good to me!
I'll try to find a workaround for the FIXED_LEN_BYTE_ARRAY
type and modify this PR if it goes well. I'll test with older versions.
Here are the outputs of the old and the new versions of ygm/examples/io/arrow_parquet_stream_reader.cpp.
Only the 5th lines are different, which use the schema() function I changed in this PR. Lines 6–12 are the outputs of schema_to_string(). As you can see, you can still print out the logical type information using this function.
Old ver:
Arrow Parquet file parser example
4 files in ../test/data/parquet_files/
#Fields: 5
Schema:
String:Make_and_model_string, None:Country_char[4], Int(bitWidth=64, isSigned=false):Top_speed_uint64, None:0-60_time_double, None:EV_or_not_boolean,
required group field_id=-1 schema {
optional binary field_id=-1 Make_and_model_string (String);
optional fixed_len_byte_array(4) field_id=-1 Country_char[4];
optional int64 field_id=-1 Top_speed_uint64 (Int(bitWidth=64, isSigned=false));
optional double field_id=-1 0-60_time_double;
optional boolean field_id=-1 EV_or_not_boolean;
}
#Rows: 12
(Make_and_model_string) (Country_char[4]) (Top_speed_uint64) (0-60_time_double) (EV_or_not_boolean)
0: Koenigsegg Jesko Absolut, SWE, 330, 2.3, false
0: Hennessey Venom F5, USA, 311, 2.4, false
0: Bugatti Bolide, GER, 310, 2.17, false
1: Bugatti Chiron Super Sport 300+, GER, 304, 2.3, false
1: SSC Tuatara, USA, 283, 2.5, false
1: Rimac Nevera, USA, 258, 1.97, true
New ver:
Arrow Parquet file parser example
4 files in ../test/data/parquet_files/
#Fields: 5
Schema:
BYTE_ARRAY:Make_and_model_string, FIXED_LEN_BYTE_ARRAY:Country_char[4], INT64:Top_speed_uint64, DOUBLE:0-60_time_double, BOOLEAN:EV_or_not_boolean,
required group field_id=-1 schema {
optional binary field_id=-1 Make_and_model_string (String);
optional fixed_len_byte_array(4) field_id=-1 Country_char[4];
optional int64 field_id=-1 Top_speed_uint64 (Int(bitWidth=64, isSigned=false));
optional double field_id=-1 0-60_time_double;
optional boolean field_id=-1 EV_or_not_boolean;
}
#Rows: 12
(Make_and_model_string) (Country_char[4]) (Top_speed_uint64) (0-60_time_double) (EV_or_not_boolean)
0: Koenigsegg Jesko Absolut, SWE, 330, 2.3, false
0: Hennessey Venom F5, USA, 311, 2.4, false
0: Bugatti Bolide, GER, 310, 2.17, false
1: Bugatti Chiron Super Sport 300+, GER, 304, 2.3, false
1: SSC Tuatara, USA, 283, 2.5, false
1: Rimac Nevera, USA, 258, 1.97, true
We decided not to support FIXED_LEN_BYTE_ARRAY for now. @steiltre This PR is ready for review. Let me know if you have any comments!
This looks good. Merging in to develop.
read_parquet_as_json
) that converts Parquet data to Boost.JSON object.arrow_parquet_parser
) to return Parquet physical types rather than Parquet logical types.FIXED_LEN_BYTE_ARRAY
type because it requires fundamental changes in our parquet reader.FIXED_LEN_BYTE_ARRAY
data fromparquet::StreamReader
.Questions