apache / iceberg-python

Apache PyIceberg
https://py.iceberg.apache.org/
Apache License 2.0
590 stars 218 forks source link

Return an empty dict if nan values is not provided by the catalog #1575

Closed summermousa-vendia closed 1 week ago

summermousa-vendia commented 1 week ago

Fixes: https://github.com/apache/iceberg-python/issues/1574

Demo (redacted):

>>> table.inspect.entries()
pyarrow.Table
status: int8 not null
snapshot_id: int64 not null
sequence_number: int64 not null
file_sequence_number: int64 not null
data_file: struct<content: int8 not null, file_path: string not null, file_format: string not null, partition: struct<> not null, record_count: int64 not null, file_size_in_bytes: int64 not null, column_sizes: map<int32, int64>, value_counts: map<int32, int64>, null_value_counts: map<int32, int64>, nan_value_counts: map<int32, int64>, lower_bounds: map<int32, binary>, upper_bounds: map<int32, binary>, key_metadata: binary, split_offsets: list<item: int64>, equality_ids: list<item: int32>, sort_order_id: int32> not null
  child 0, content: int8 not null
  child 1, file_path: string not null
  child 2, file_format: string not null
  child 3, partition: struct<> not null
  child 4, record_count: int64 not null
  child 5, file_size_in_bytes: int64 not null
  child 6, column_sizes: map<int32, int64>
      child 0, entries: struct<key: int32 not null, value: int64> not null
          child 0, key: int32 not null
          child 1, value: int64
  child 7, value_counts: map<int32, int64>
      child 0, entries: struct<key: int32 not null, value: int64> not null
          child 0, key: int32 not null
          child 1, value: int64
  child 8, null_value_counts: map<int32, int64>
      child 0, entries: struct<key: int32 not null, value: int64> not null
          child 0, key: int32 not null
          child 1, value: int64
  child 9, nan_value_counts: map<int32, int64>
      child 0, entries: struct<key: int32 not null, value: int64> not null
          child 0, key: int32 not null
          child 1, value: int64
  child 10, lower_bounds: map<int32, binary>
      child 0, entries: struct<key: int32 not null, value: binary> not null
          child 0, key: int32 not null
          child 1, value: binary
  child 11, upper_bounds: map<int32, binary>
      child 0, entries: struct<key: int32 not null, value: binary> not null
          child 0, key: int32 not null
          child 1, value: binary
  child 12, key_metadata: binary
  child 13, split_offsets: list<item: int64>
      child 0, item: int64
  child 14, equality_ids: list<item: int32>
      child 0, item: int32
  child 15, sort_order_id: int32
readable_metrics: struct<age: struct<column_size: int64, value_count: int64, null_value_count: int64, nan_value_count: int64, lower_bound: double, upper_bound: double> not null, name: struct<column_size: int64, value_count: int64, null_value_count: int64, nan_value_count: int64, lower_bound: large_string, upper_bound: large_string> not null, weight: struct<column_size: int64, value_count: int64, null_value_count: int64, nan_value_count: int64, lower_bound: double, upper_bound: double> not null>
  child 0, age: struct<column_size: int64, value_count: int64, null_value_count: int64, nan_value_count: int64, lower_bound: double, upper_bound: double> not null
      child 0, column_size: int64
      child 1, value_count: int64
      child 2, null_value_count: int64
      child 3, nan_value_count: int64
      child 4, lower_bound: double
      child 5, upper_bound: double
  child 1, name: struct<column_size: int64, value_count: int64, null_value_count: int64, nan_value_count: int64, lower_bound: large_string, upper_bound: large_string> not null
      child 0, column_size: int64
      child 1, value_count: int64
      child 2, null_value_count: int64
      child 3, nan_value_count: int64
      child 4, lower_bound: large_string
      child 5, upper_bound: large_string
  child 2, weight: struct<column_size: int64, value_count: int64, null_value_count: int64, nan_value_count: int64, lower_bound: double, upper_bound: double> not null
      child 0, column_size: int64
      child 1, value_count: int64
      child 2, null_value_count: int64
      child 3, nan_value_count: int64
      child 4, lower_bound: double
      child 5, upper_bound: double
----
status: [[1,1,1,1,1,1,1,1,1]]
snapshot_id: [[1838977523369912061,1838977523369912061,1838977523369912061,1838977523369912061,1838977523369912061,1838977523369912061,1838977523369912061,1838977523369912061,1838977523369912061]]
sequence_number: [[2,2,2,2,2,2,2,2,2]]
file_sequence_number: [[2,2,2,2,2,2,2,2,2]]
data_file: [
  -- is_valid: all not null
  -- child 0 type: int8
[0,0,0,0,0,0,0,0,0]
  -- child 1 type: string
["s3://***","s3://***","s3://***","s3://***","s3://***","s3://***","s3://***","s3://***","s3://***"]
  -- child 2 type: string
["PARQUET","PARQUET","PARQUET","PARQUET","PARQUET","PARQUET","PARQUET","PARQUET","PARQUET"]
  -- child 3 type: struct<>
    -- is_valid: all not null
  -- child 4 type: int64
[1,1,1,1,1,1,1,1,1]
  -- child 5 type: int64
[991,992,985,963,984,971,957,978,992]
  -- child 6 type: map<int32, int64>
[keys:[1,2,3]values:[46,55,45],keys:[1,2,3]values:[46,55,46],keys:[1,2,3]values:[46,54,46],keys:[1,2,3]values:[46,50,46],keys:[1,2,3]values:[45,54,46],keys:[1,2,3]values:[46,52,46],keys:[1,2,3]values:[46,50,46],keys:[1,2,3]values:[46,53,46],keys:[1,2,3]values:[46,55,46]]
  -- child 7 type: map<int32, int64>
[keys:[1,2,3]values:[1,1,1],keys:[1,2,3]values:[1,1,1],keys:[1,2,3]values:[1,1,1],keys:[1,2,3]values:[1,1,1],keys:[1,2,3]values:[1,1,1],keys:[1,2,3]values:[1,1,1],keys:[1,2,3]values:[1,1,1],keys:[1,2,3]values:[1,1,1],keys:[1,2,3]values:[1,1,1]]
  -- child 8 type: map<int32, int64>
[keys:[1,2,3]values:[0,0,0],keys:[1,2,3]values:[0,0,0],keys:[1,2,3]values:[0,0,0],keys:[1,2,3]values:[0,0,0],keys:[1,2,3]values:[0,0,0],keys:[1,2,3]values:[0,0,0],keys:[1,2,3]values:[0,0,0],keys:[1,2,3]values:[0,0,0],keys:[1,2,3]values:[0,0,0]]
  -- child 9 type: map<int32, int64>
[keys:[1,3]values:[0,0],keys:[1,3]values:[0,0],keys:[1,3]values:[0,0],keys:[1,3]values:[0,0],keys:[1,3]values:[0,0],keys:[1,3]values:[0,0],keys:[1,3]values:[0,0],keys:[1,3]values:[0,0],keys:[1,3]values:[0,0]]
  -- child 10 type: map<int32, binary>
[keys:[1,2,3]values:[0000000000003640,436861726C6965204461766973,0000000000406540],keys:[1,2,3]values:[0000000000804640,42696C6C792042757463686572,0000000000006940],keys:[1,2,3]values:[0000000000804040,48616E6E616820477265656E,0000000000406040],keys:[1,2,3]values:[0000000000804140,426F622042726F776E,0000000000006940],keys:[1,2,3]values:[0000000000004440,47656F72676520426C61636B,0000000000406A40],keys:[1,2,3]values:[0000000000003940,4A616E6520536D697468,0000000000806140],keys:[1,2,3]values:[0000000000003E40,4A6F686E20446F65,0000000000806640],keys:[1,2,3]values:[0000000000003B40,456D696C79205768697465,0000000000006440],keys:[1,2,3]values:[0000000000003C40,416C696365204A6F686E736F6E,0000000000C06240]]
  -- child 11 type: map<int32, binary>
[keys:[1,2,3]values:[0000000000003640,436861726C6965204461766973,0000000000406540],keys:[1,2,3]values:[0000000000804640,42696C6C792042757463686572,0000000000006940],keys:[1,2,3]values:[0000000000804040,48616E6E616820477265656E,0000000000406040],keys:[1,2,3]values:[0000000000804140,426F622042726F776E,0000000000006940],keys:[1,2,3]values:[0000000000004440,47656F72676520426C61636B,0000000000406A40],keys:[1,2,3]values:[0000000000003940,4A616E6520536D697468,0000000000806140],keys:[1,2,3]values:[0000000000003E40,4A6F686E20446F65,0000000000806640],keys:[1,2,3]values:[0000000000003B40,456D696C79205768697465,0000000000006440],keys:[1,2,3]values:[0000000000003C40,416C696365204A6F686E736F6E,0000000000C06240]]
  -- child 12 type: binary
[null,null,null,null,null,null,null,null,null]
  -- child 13 type: list<item: int64>
[[4],[4],...,[4],[4]]
  -- child 14 type: list<item: int32>
[null,null,...,null,null]
  -- child 15 type: int32
[0,0,0,0,0,0,0,0,0]]
readable_metrics: [
  -- is_valid: all not null
  -- child 0 type: struct<column_size: int64, value_count: int64, null_value_count: int64, nan_value_count: int64, lower_bound: double, upper_bound: double>
    -- is_valid: all not null
    -- child 0 type: int64
[46,46,46,46,45,46,46,46,46]
    -- child 1 type: int64
[1,1,1,1,1,1,1,1,1]
    -- child 2 type: int64
[0,0,0,0,0,0,0,0,0]
    -- child 3 type: int64
[0,0,0,0,0,0,0,0,0]
    -- child 4 type: double
[22,45,33,35,40,25,30,27,28]
    -- child 5 type: double
[22,45,33,35,40,25,30,27,28]
  -- child 1 type: struct<column_size: int64, value_count: int64, null_value_count: int64, nan_value_count: int64, lower_bound: large_string, upper_bound: large_string>
    -- is_valid: all not null
    -- child 0 type: int64
[55,55,54,50,54,52,50,53,55]
    -- child 1 type: int64
[1,1,1,1,1,1,1,1,1]
    -- child 2 type: int64
[0,0,0,0,0,0,0,0,0]
    -- child 3 type: int64
[null,null,null,null,null,null,null,null,null]
    -- child 4 type: large_string
["Charlie Davis","Billy Butcher","Hannah Green","Bob Brown","George Black","Jane Smith","John Doe","Emily White","Alice Johnson"]
    -- child 5 type: large_string
["Charlie Davis","Billy Butcher","Hannah Green","Bob Brown","George Black","Jane Smith","John Doe","Emily White","Alice Johnson"]
  -- child 2 type: struct<column_size: int64, value_count: int64, null_value_count: int64, nan_value_count: int64, lower_bound: double, upper_bound: double>
    -- is_valid: all not null
    -- child 0 type: int64
[45,46,46,46,46,46,46,46,46]
    -- child 1 type: int64
[1,1,1,1,1,1,1,1,1]
    -- child 2 type: int64
[0,0,0,0,0,0,0,0,0]
    -- child 3 type: int64
[0,0,0,0,0,0,0,0,0]
    -- child 4 type: double
[170,200,130,200,210,140,180,160,150]
    -- child 5 type: double
[170,200,130,200,210,140,180,160,150]]
summermousa-vendia commented 1 week ago

Thank you for the quick turnaround on the review. Do you know when this might be released?

kevinjqliu commented 1 week ago

hi @summermousa-vendia this would be part of the next release (0.9.0). I dont have a timeline yet, but it should be soon. There's a community sync tomorrow, I'll bring this up.