Closed timsaucer closed 1 month ago
My statement above about testing on rust side is likely incorrect. I ran the same test above but loading the dataframe from a parquet file instead of creating in memory and the expected behavior is reproduced.
If you amend these lines to the bottom of the minimal example
df.write_parquet("save_out.parquet")
df_reread = ctx.read_parquet("save_out.parquet")
df_reread.show()
df_reread.select(col("a")["outer_1"]["inner_2"]).show()
You get the expected result
DataFrame()
+-------------------------------------+
| a |
+-------------------------------------+
| {outer_1: {inner_1: 1, inner_2: 2}} |
| {outer_1: {inner_1: 1, inner_2: }} |
| {outer_1: } |
+-------------------------------------+
DataFrame()
+-----------------------------+
| ?table?.a[outer_1][inner_2] |
+-----------------------------+
| 2 |
| |
| |
+-----------------------------+
It also shows the original table is reproduced. I'll continue digging but I no longer am convinced this is a python binding issue.
Further testing on the rust side makes me think it is something about how the batch record is created in pyarrow. I created the same dataframe using StructBuilder in the below gist and cannot reproduce the problem.
https://gist.github.com/timsaucer/7527c0851b379d4e9c466d8972d49a01
I think I know what's going on.
Even if outer
is null, we still have data within inner_1
and inner_2
. When pyarrow creates the record batch, it sets these to the default value rather than null even though the outer struct is null. Then on the datafusion side we index into these and get those default values.
I think the right place to resolve this is in pyarrow setting null when all outer values are null. But maybe there is additional validity checks we should have. I'm going to think a little more about this issue before moving it to the most appropriate repo.
In my gist above, I went back an inserted values into the subfields inner_1
and inner_2
even though outer
was null and I am able to reproduce the problem above, so I definitely think this is not a datafusion-python problem.
Closing in favor of https://github.com/apache/arrow/issues/41833
Describe the bug When you have a column that is a struct of struct and you attempt to index into the lowest level, if there is a null at the first level of the struct you get an unexpected result. In the dataframe below I have an
outer_1
stuct that if it is null and we try to access an inner member, we would expect to also get a null.I have exported this dataframe to parquet and tested on the rust side and the problem does not exist there, so I think it is something in this repo.
To Reproduce
Produces:
Expected behavior
Accessing a subfield of a null entry should also return null.