apache / arrow-go

Official Go implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
15 stars 5 forks source link

structs: Support casting nullable fields to non-nullable if there are no null values #133

Open NickCrews opened 1 month ago

NickCrews commented 1 month ago

This solves https://github.com/apache/arrow/issues/33592

This is a sister PR to https://github.com/apache/arrow/pull/43782, I was told to open a PR here to make this change on the Go side.

I need a few pointers though if you want me to do the implementation. Could you give me a high-level summary of what needs to happen? I did an initial exploration and had a few notes/questions:

zeroshade commented 4 weeks ago

More philosophically, is this behavior actually in the arrow spec? Or is the spec just about the in-memory format, and implementations are free to define their compute/casting behavior as desired? It seems funny/non-ideal to me that it is possible for the go implementation of this casting to drift away from the c++ implementation. I would think it would be better if there was some test that verified they both had the same behavior?

This is the case, the spec just defines the in-memory format. Anything implementing compute and casting behavior is outside the scope of the spec itself as different compute engines are free to have different behaviors. For a spec attempting to create standardized behavior on compute functionality, there is https://substrait.io

That said, I agree the ideal would be for the go implementation to follow the C++ implementation where possible for these things.

I went looking for an equivalent nullable to non-nullable cast test for non-struct arrays, but couldn't find any. Is this possibly not tested? Or do I misunderstand the arrow spec, and all datatypes EXCEPT struct fields are nullable, so there is no eg non-nullable uint16?

By the Arrow spec, "nullability" is a feature of "fields" as it is simply a piece of metadata to inform consumers of data. So you can only mark something as non-nullable by way of a schema (i.e. top-level columns in a record batch can be marked non-nullable via their fields in the schema) or by virtue of it being a nested field of some other type (i.e. struct, list, map, etc. type) which contains fields to describe its children.

So you can have a non-nullable uint16, if the schema/parent struct/parent list type defines the column field as non-nullable. Since casting doesn't accept record batches, only Array or ChunkedArray, the only cases where you might run into this nullability issue are in the child fields of these nested types. Since we do provide factories for constructing lists with non-nullable element fields, we probably should add tests to verify we handle nullability/non-nullability for lists/maps/etc.