apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.59k stars 3.54k forks source link

Unexpected Behaviour of nullable fields #36060

Open christiangiessleraracom opened 1 year ago

christiangiessleraracom commented 1 year ago

Describe the usage question you have. Please include as many useful details as possible.

The following problem: If an optional field is not specified in the json, but is in the schema, it is still created in the pyarrow table, including all nested fields that are specified in the schema (with null values). Is this the intended behaviour or is there a setting option so that non-existent fields are also not in the table?

Here is a data example. our productive data schema is of course much more complex and more nested, but it illustrates what I am doing:

Schema (all fields are nullable):

field1: struct<subfield1: double, subfield2: double>
field2: timestamp[ms]
field3: double

json file:

{
  "field3": 123.4
}

Python code handling the data:

read_options = pajson.ReadOptions(block_size=1600000000)

parse_options = pajson.ParseOptions(
    explicit_schema=pa_schema,
    unexpected_field_behavior="ignore"
)
table = pajson.read_json(
    tmp_file_name,  read_options=read_options, parse_options=parse_options
)

pq.write_to_dataset(
    table=table,
    root_path=dataset_path,
    basename_template=hashvalue + ".parquet",
    existing_data_behavior="overwrite_or_ignore",
    schema=pa_schema
)

table debug output from evaluation in pycharm:

column_names: ['field1', 'field2', 'field3']
columns:
[
  -- is_valid:
      [
      false
    ]
  -- child 0 type: double
    [
      null
    ]
  -- child 1 type: double
    [
      null
    ]
]
[
  [
    null
  ]
]
[
  [
    123.4
  ]
]

Component(s)

Python

js8544 commented 1 year ago

Hi, this is expected because all columns in a table object must have the same length.

christiangiessleraracom commented 1 year ago

Hi, this is expected because all columns in a table object must have the same length.

thanks for your reply.

yes. the length is clear. but is there a possibility that columns that do not contain any values at all are not written into the table? if something is not there, it does not have a length.

js8544 commented 1 year ago

If you don't pass an explicit_schema to read_json, the resulting table won't contain any null columns. If you must set a schema, you can use this piece of code to drop all null columns after the table is created:

all_null_columns = []
for col, name in zip(table.columns, table.column_names):
    if col.null_count == col.length():
      all_null_columns.append(name)
table = table.drop_columns(all_null_columns)

Skipping all null columns at parsing time just doesn't feel right, because if an explicit_schema is set, a table with that schema should be expected, right?

christiangiessleraracom commented 1 year ago

If you don't pass an explicit_schema to read_json, the resulting table won't contain any null columns. If you must set a schema, you can use this piece of code to drop all null columns after the table is created:

all_null_columns = []
for col, name in zip(table.columns, table.column_names):
    if col.null_count == col.length():
      all_null_columns.append(name)
table = table.drop_columns(all_null_columns)

Skipping all null columns at parsing time just doesn't feel right, because if an explicit_schema is set, a table with that schema should be expected, right?

Thanks. yes, i have to pass a schema. i'll try your code