apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.46k stars 2.24k forks source link

[Parquet] When reading struct-type data without an id in iceberg-parquet, it returns null values. #11214

Open FlechazoW opened 1 month ago

FlechazoW commented 1 month ago

Apache Iceberg version

main (development)

Query engine

None

Please describe the bug 🐞

image image

For nested struct types, when group.field.getId returns null, it causes iField to be null, and subsequently, the ParquetValueReader is also null, resulting in the struct type being unable to read the data.

Willingness to contribute

FlechazoW commented 1 month ago

I am not sure if this is a bug or an issue caused by improper usage. If it is a bug, please let me know, and I can help fix it, thanks.

nastra commented 1 month ago

@FlechazoW do you have a reproducible example where this happens?

FlechazoW commented 1 month ago

This is the meta of parquet file

Schema:
message schema {
  optional boolean col_boolean;
  optional int32 col_tinyint (INTEGER(8,true));
  optional int32 col_smallint (INTEGER(16,true));
  optional int32 col_int;
  optional int64 col_bigint;
  optional float col_float;
  optional double col_double;
  optional fixed_len_byte_array(16) col_decimal (DECIMAL(38,18));
  optional binary col_string (STRING);
  optional binary col_varchar (STRING);
  optional binary col_binary;
  optional int64 col_timestamp (TIMESTAMP(MICROS,true));
  optional int64 col_datetime (TIMESTAMP(MICROS,true));
  optional group col_array (LIST) {
    repeated group list {
      optional group element (MAP) {
        repeated group key_value {
          required int64 key;
          optional int64 value;
        }
      }
    }
  }
  optional group col_array_int (LIST) {
    repeated group list {
      optional int64 element;
    }
  }
  optional group col_array_double (LIST) {
    repeated group list {
      optional double element;
    }
  }
  optional group col_array_string (LIST) {
    repeated group list {
      optional binary element (STRING);
    }
  }
  optional group col_map (MAP) {
    repeated group key_value {
      required binary key (STRING);
      optional group value (LIST) {
        repeated group list {
          optional int64 element;
        }
      }
    }
  }
  optional group col_struct {
    optional binary s1 (STRING);
    optional int64 s2;
  }
  optional int64 col_map_bigint;
  optional group col_map_int (MAP) {
    repeated group key_value {
      required binary key (STRING);
      optional int32 value;
    }
  }
  optional int32 col_date (DATE);
  optional binary col_json (STRING);
}

Row group 0:  count: 2  1.129 kB records  start: 4  total(compressed): 2.258 kB total(uncompressed):1.823 kB 
--------------------------------------------------------------------------------
                                        type      encodings count     avg size   nulls   min / max
col_boolean                             BOOLEAN   Z   _     2         19.50 B    0       "true" / "true"
col_tinyint                             INT32     Z _ R     2         39.00 B    0       "1" / "2"
col_smallint                            INT32     Z _ R     2         39.00 B    0       "2" / "3"
col_int                                 INT32     Z _ R     2         39.00 B    0       "3" / "4"
col_bigint                              INT64     Z _ R     2         43.00 B    0       "4" / "5"
col_float                               FLOAT     Z _ R     2         39.00 B    0       "5.0" / "6.0"
col_double                              DOUBLE    Z _ R     2         43.00 B    0       "6.0" / "7.0"
col_decimal                             FIXED[16] Z _ R     2         48.50 B  0       "7.123000000000000000" / "8.122999999999999000"
col_string                              BINARY    Z _ R     2         44.50 B    0       "字符串示例" / "字符串示例"
col_varchar                             BINARY    Z _ R     2         43.50 B    0       "varchar示例" / "varchar示例"
col_binary                              BINARY    Z _ R     2         50.00 B    0       "0x5B42403636353138353863" / "0x5B42403765616330393937"
col_timestamp                           INT64     Z _ R     2         39.00 B    0       "2023-04-01T04:00:00.00000..." / "2023-04-01T04:00:00.00000..."
col_datetime                            INT64     Z _ R     2         39.00 B    0       "2023-04-01T04:00:00.00000..." / "2023-04-01T04:00:00.00000..."
col_array.list.element.key_value.key    INT64     Z _ R     4         23.25 B            
col_array.list.element.key_value.value  INT64     Z _ R     4         23.25 B            
col_array_int.list.element              INT64     Z _ R     6         16.17 B            
col_array_double.list.element           DOUBLE    Z _ R     4         23.00 B            
col_array_string.list.element           BINARY    Z _ R     6         16.33 B            
col_map.key_value.key                   BINARY    Z _ R     4         23.00 B            
col_map.key_value.value.list.element    INT64     Z _ R     8         12.38 B            
col_struct.s1                           BINARY    Z _ R     2         41.00 B    0       "s1的值" / "s1的值"
col_struct.s2                           INT64     Z _ R     2         39.00 B    0       "1" / "1"
col_map_bigint                          INT64     Z _ R     2         39.00 B    0       "8" / "8"
col_map_int.key_value.key               BINARY    Z _ R     4         23.00 B            
col_map_int.key_value.value             INT32     Z _ R     4         21.00 B            
col_date                                INT32     Z _ R     2         36.50 B    0       "2017-11-11" / "2017-11-11"
col_json                                BINARY    Z _ R     2         54.50 B    0       "123" / "{"id":11,"name":"Lakehouse"}"
FlechazoW commented 1 month ago

@nastra Do you need any additional information?

FlechazoW commented 1 month ago
image
FlechazoW commented 1 month ago
image
FlechazoW commented 1 month ago
image
ashokvengala1990 commented 1 month ago

I see a similar issue with struct columns.

First, the code checks whether the file schema (parquet file) has IDs. If not, it creates IDs for each column starting from ordinal = 1. However, the fields inside the struct column don't have IDs assigned to them. This discrepancy causes issues when building a record reader for struct column.

Additionally, for struct column, the method recursively calls visitFields with the struct column type. During this process, it cannot find the IDs for the fields inside the struct, leading to a null record reader object. Consequently, the struct column returns null.

I will be sharing unit test case soon.