apache / arrow-rs

Official Rust implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
2.6k stars 796 forks source link

Casting to and from unions #6247

Closed samuelcolvin closed 1 month ago

samuelcolvin commented 3 months ago

Continuing from https://github.com/apache/arrow-rs/pull/6218#pullrequestreview-2236585058 — I thought it worth creating a dedicated issue to discuss this before writing any more code.

Well pyarrow doesn't help much (or maybe it helps a lot by giving us flexibility!)

All four cases fail:

Unsupported cast to sparse_union<int_field: int32=0, string_field: string=1> from int32
Unsupported cast to dense_union<int_field: int32=0, string_field: string=1> from int32
Unsupported cast from sparse_union<0: int32=0, 1: string=1> to int32 using function cast_int32
Unsupported cast from dense_union<0: int64=0, 1: bool=1> to int32 using function cast_int32
Python Code ```py import pyarrow as pa int_array = pa.array([1, 2, 3, 4, 5], type=pa.int32()) union_fields = [ pa.field('int_field', pa.int32()), pa.field('string_field', pa.string()) ] try: print(int_array.cast(pa.union(union_fields, mode='sparse'))) except Exception as e: print(e) else: print('success') try: print(int_array.cast(pa.union(union_fields, mode='dense'))) except Exception as e: print(e) else: print('success') sparse_indices = pa.array([0, 1, 0, 1, 0], type=pa.int8()) sparse_children = [ pa.array([1, None, 3, None, None], type=pa.int32()), pa.array([None, 'a', None, 'b', None], type=pa.string()), ] sparse_union_array = pa.UnionArray.from_sparse(sparse_indices, sparse_children) # print(sparse_union_array) try: print(sparse_union_array.cast(pa.int32())) except Exception as e: print(e) else: print('success') dense_types = pa.array([0, 1, 1, 0, 0], type=pa.int8()) dense_offsets = pa.array([0, 0, 1, 1, 2], type=pa.int32()) dense_children = [ pa.array([5, 6, 7]), pa.array([False, True]), ] dense_union_array = pa.UnionArray.from_dense(dense_types, dense_offsets, dense_children) # print(dense_union_array) try: print(dense_union_array.cast(pa.int32())) except Exception as e: print(e) else: print('success') ```

Here's my proposal for what we support and don't support (yet):

Casting to sparse and dense union

We choose the most appropriate child to cast to using the current logic - choose the exact matching type, otherwise the first type you can cast to, left to right.

I think this is fairly simple, uncontroversial and already implemented in #6218.

Casting from sparse and dense unions

I think we can support both sparse and dense using either zip, interleave or take — any suggestion on which will be fastest much appreciated.

We can do this, either:

  1. requiring one or more fields to be castable to the output type, and just casting those children, leaving values associated with other children null
  2. or, requiring all fields to be castable

I think @alamb suggested he'd prefer 2., I started implementing 1. in #6218 — this is so we can use this union cast logic for datafusion-functions-json, to match postgres behaviour.

When the user queries:

select count(*) from foo where (thing->'field')::int=4

The value returned from thing->'field' is a JsonUnion, hence I need that to be cast to an int even though that union includes stuff like string, object and array that can't be cast to an int.

(I'm trying to roughly match PostgreSQL where select ('{"foo": 123}'::jsonb->'foo')::int is valid)

If we go with route 2. above, this expression would raise an error.

Note: for the above case of (thing->'field')::int, we already do an optimisation pass where we convert json_get_union(thing, 'field')::int to json_get_int(thing, 'field') and therefore avoid this problem. My reason for implementing casting from unions in the first place was to support expression where JsonUnion is compared to values, but the optimization won't or can't work, e.g. if thing->'field' is in a CTE, then used later.

I guess if we decide that route 2. is correct, I have a few options:

samuelcolvin commented 3 months ago

Much as route 1. (very lax casting of unions) would simplify my use case, in writing this up I realised that probably doesn't make much sense in general.

alamb commented 3 months ago

Much as route 1. (very lax casting of unions) would simplify my use case, in writing this up I realised that probably doesn't make much sense in general.

Yeah -- I think of a Union as the way in arrow to represent a dynamically typed value: each row of a union can be one of a set of different types

So I guess if we are casting from a union array to another type, I would as a user expect that each row of the union (regardless of what variant it was) would be cast to the target type

gstvg commented 2 months ago

Hey @samuelcolvin, if I get it right, your first option looks like union_extract from DuckDB. Example usage~%0AINSERT-INTO-tbl1-values-(1)%2C-('two')%2C-(union_value(str-%3A%3D-'three'))~,SELECT-union_extract(u%2C-'str')-AS-str%0AFROM-tbl1~)

I'm trying to implement it at https://github.com/apache/datafusion/pull/12116 in case it helps

DuckDB union cast docs may also be of interest

samuelcolvin commented 2 months ago

interesting, I don't know what @alamb things, but I'd say it would be best to implement it in this repo rather than datafusion.

alamb commented 2 months ago

I think implementing a union_extract kernel in arrow-rs makes sense to me. Starting it in datafusion and then porting it to arrow-rs also would make sense

gstvg commented 2 months ago

I can port the PR here, it will take a few days. Hopefully this would avoid duplicate review work, especially since most of the tests should be rewritten from sqllogictests to unit tests. Do you agree?