apache / datafusion-python

Apache DataFusion Python Bindings
https://datafusion.apache.org/python
Apache License 2.0
321 stars 64 forks source link

Selecting struct field within field produces unexpected results #715

Closed timsaucer closed 1 month ago

timsaucer commented 1 month ago

Describe the bug When you have a column that is a struct of struct and you attempt to index into the lowest level, if there is a null at the first level of the struct you get an unexpected result. In the dataframe below I have an outer_1 stuct that if it is null and we try to access an inner member, we would expect to also get a null.

I have exported this dataframe to parquet and tested on the rust side and the problem does not exist there, so I think it is something in this repo.

To Reproduce

ctx = SessionContext()

batch = pa.RecordBatch.from_arrays(
    [pa.array([
        {"outer_1": {"inner_1": 1, "inner_2": 2}},
        {"outer_1": {"inner_1": 1, "inner_2": None}},
        {"outer_1": None},
    ])],
    names=["a"],
)

df = ctx.create_dataframe([[batch]])

df.write_parquet("/dbfs/tmp/tsaucer/struct_of_struct.parquet")

df.select(col("a")).show()

df.select(col("a")["outer_1"]).show()

df.select(col("a")["outer_1"]["inner_2"]).show()

Produces:

03:20 PM (<1s)
ctx = SessionContext()

batch = pa.RecordBatch.from_arrays(
    [pa.array([
        {"outer_1": {"inner_1": 1, "inner_2": 2}},
        {"outer_1": {"inner_1": 1, "inner_2": None}},
        {"outer_1": None},
    ])],
    names=["a"],
)

df = ctx.create_dataframe([[batch]])

df.write_parquet("/dbfs/tmp/tsaucer/struct_of_struct.parquet")

df.select(col("a")).show()

df.select(col("a")["outer_1"]).show()

df.select(col("a")["outer_1"]["inner_2"]).show()
DataFrame()
+-------------------------------------+
| a                                   |
+-------------------------------------+
| {outer_1: {inner_1: 1, inner_2: 2}} |
| {outer_1: {inner_1: 1, inner_2: }}  |
| {outer_1: }                         |
+-------------------------------------+
DataFrame()
+----------------------------------------------+
| cc251bd408f114ca2a4354b6976d91339.a[outer_1] |
+----------------------------------------------+
| {inner_1: 1, inner_2: 2}                     |
| {inner_1: 1, inner_2: }                      |
|                                              |
+----------------------------------------------+
DataFrame()
+-------------------------------------------------------+
| cc251bd408f114ca2a4354b6976d91339.a[outer_1][inner_2] |
+-------------------------------------------------------+
| 2                                                     |
|                                                       |
| 0                                                     |
+-------------------------------------------------------+

Expected behavior

Accessing a subfield of a null entry should also return null.

timsaucer commented 1 month ago

My statement above about testing on rust side is likely incorrect. I ran the same test above but loading the dataframe from a parquet file instead of creating in memory and the expected behavior is reproduced.

If you amend these lines to the bottom of the minimal example

df.write_parquet("save_out.parquet")

df_reread = ctx.read_parquet("save_out.parquet")

df_reread.show()
df_reread.select(col("a")["outer_1"]["inner_2"]).show()

You get the expected result

DataFrame()
+-------------------------------------+
| a                                   |
+-------------------------------------+
| {outer_1: {inner_1: 1, inner_2: 2}} |
| {outer_1: {inner_1: 1, inner_2: }}  |
| {outer_1: }                         |
+-------------------------------------+
DataFrame()
+-----------------------------+
| ?table?.a[outer_1][inner_2] |
+-----------------------------+
| 2                           |
|                             |
|                             |
+-----------------------------+

It also shows the original table is reproduced. I'll continue digging but I no longer am convinced this is a python binding issue.

timsaucer commented 1 month ago

Further testing on the rust side makes me think it is something about how the batch record is created in pyarrow. I created the same dataframe using StructBuilder in the below gist and cannot reproduce the problem.

https://gist.github.com/timsaucer/7527c0851b379d4e9c466d8972d49a01

timsaucer commented 1 month ago

I think I know what's going on.

Even if outer is null, we still have data within inner_1 and inner_2. When pyarrow creates the record batch, it sets these to the default value rather than null even though the outer struct is null. Then on the datafusion side we index into these and get those default values.

I think the right place to resolve this is in pyarrow setting null when all outer values are null. But maybe there is additional validity checks we should have. I'm going to think a little more about this issue before moving it to the most appropriate repo.

timsaucer commented 1 month ago

In my gist above, I went back an inserted values into the subfields inner_1 and inner_2 even though outer was null and I am able to reproduce the problem above, so I definitely think this is not a datafusion-python problem.

timsaucer commented 1 month ago

Closing in favor of https://github.com/apache/arrow/issues/41833