Open FlechazoW opened 1 month ago
I am not sure if this is a bug or an issue caused by improper usage. If it is a bug, please let me know, and I can help fix it, thanks.
@FlechazoW do you have a reproducible example where this happens?
This is the meta of parquet file
Schema:
message schema {
optional boolean col_boolean;
optional int32 col_tinyint (INTEGER(8,true));
optional int32 col_smallint (INTEGER(16,true));
optional int32 col_int;
optional int64 col_bigint;
optional float col_float;
optional double col_double;
optional fixed_len_byte_array(16) col_decimal (DECIMAL(38,18));
optional binary col_string (STRING);
optional binary col_varchar (STRING);
optional binary col_binary;
optional int64 col_timestamp (TIMESTAMP(MICROS,true));
optional int64 col_datetime (TIMESTAMP(MICROS,true));
optional group col_array (LIST) {
repeated group list {
optional group element (MAP) {
repeated group key_value {
required int64 key;
optional int64 value;
}
}
}
}
optional group col_array_int (LIST) {
repeated group list {
optional int64 element;
}
}
optional group col_array_double (LIST) {
repeated group list {
optional double element;
}
}
optional group col_array_string (LIST) {
repeated group list {
optional binary element (STRING);
}
}
optional group col_map (MAP) {
repeated group key_value {
required binary key (STRING);
optional group value (LIST) {
repeated group list {
optional int64 element;
}
}
}
}
optional group col_struct {
optional binary s1 (STRING);
optional int64 s2;
}
optional int64 col_map_bigint;
optional group col_map_int (MAP) {
repeated group key_value {
required binary key (STRING);
optional int32 value;
}
}
optional int32 col_date (DATE);
optional binary col_json (STRING);
}
Row group 0: count: 2 1.129 kB records start: 4 total(compressed): 2.258 kB total(uncompressed):1.823 kB
--------------------------------------------------------------------------------
type encodings count avg size nulls min / max
col_boolean BOOLEAN Z _ 2 19.50 B 0 "true" / "true"
col_tinyint INT32 Z _ R 2 39.00 B 0 "1" / "2"
col_smallint INT32 Z _ R 2 39.00 B 0 "2" / "3"
col_int INT32 Z _ R 2 39.00 B 0 "3" / "4"
col_bigint INT64 Z _ R 2 43.00 B 0 "4" / "5"
col_float FLOAT Z _ R 2 39.00 B 0 "5.0" / "6.0"
col_double DOUBLE Z _ R 2 43.00 B 0 "6.0" / "7.0"
col_decimal FIXED[16] Z _ R 2 48.50 B 0 "7.123000000000000000" / "8.122999999999999000"
col_string BINARY Z _ R 2 44.50 B 0 "字符串示例" / "字符串示例"
col_varchar BINARY Z _ R 2 43.50 B 0 "varchar示例" / "varchar示例"
col_binary BINARY Z _ R 2 50.00 B 0 "0x5B42403636353138353863" / "0x5B42403765616330393937"
col_timestamp INT64 Z _ R 2 39.00 B 0 "2023-04-01T04:00:00.00000..." / "2023-04-01T04:00:00.00000..."
col_datetime INT64 Z _ R 2 39.00 B 0 "2023-04-01T04:00:00.00000..." / "2023-04-01T04:00:00.00000..."
col_array.list.element.key_value.key INT64 Z _ R 4 23.25 B
col_array.list.element.key_value.value INT64 Z _ R 4 23.25 B
col_array_int.list.element INT64 Z _ R 6 16.17 B
col_array_double.list.element DOUBLE Z _ R 4 23.00 B
col_array_string.list.element BINARY Z _ R 6 16.33 B
col_map.key_value.key BINARY Z _ R 4 23.00 B
col_map.key_value.value.list.element INT64 Z _ R 8 12.38 B
col_struct.s1 BINARY Z _ R 2 41.00 B 0 "s1的值" / "s1的值"
col_struct.s2 INT64 Z _ R 2 39.00 B 0 "1" / "1"
col_map_bigint INT64 Z _ R 2 39.00 B 0 "8" / "8"
col_map_int.key_value.key BINARY Z _ R 4 23.00 B
col_map_int.key_value.value INT32 Z _ R 4 21.00 B
col_date INT32 Z _ R 2 36.50 B 0 "2017-11-11" / "2017-11-11"
col_json BINARY Z _ R 2 54.50 B 0 "123" / "{"id":11,"name":"Lakehouse"}"
@nastra Do you need any additional information?
I see a similar issue with struct columns.
First, the code checks whether the file schema (parquet file) has IDs. If not, it creates IDs for each column starting from ordinal = 1. However, the fields inside the struct column don't have IDs assigned to them. This discrepancy causes issues when building a record reader for struct column.
Additionally, for struct column, the method recursively calls visitFields with the struct column type. During this process, it cannot find the IDs for the fields inside the struct, leading to a null record reader object. Consequently, the struct column returns null.
I will be sharing unit test case soon.
Apache Iceberg version
main (development)
Query engine
None
Please describe the bug 🐞
For nested struct types, when group.field.getId returns null, it causes iField to be null, and subsequently, the ParquetValueReader is also null, resulting in the struct type being unable to read the data.
Willingness to contribute