Open asfimport opened 6 years ago
Antoine Pitrou / @pitrou: I'm not sure it's a good idea to do this by default, since it would hide problems with unsanitized input. Intuitively, I don't think real "union"-type data is frequent in the real world (as opposed to unsanitized data).
Also, we would have to choose a union kind (dense or sparse).
Uwe Korn / @xhochy:
This is definitely something users would like to see but I would also like to see this hidden behind a flag. Being able to deal with unsanitized input is often a typical pandas
use case in exploratory data analysis but once you use this as part of a production pipeline, you rather want to have it error.
Wes McKinney / @wesm:
I agree that disabling by default is a good idea. Maybe we can have an allow_unions=True
flag
Wes McKinney / @wesm: Unions will be really helpful when we get to working on CSV reading – handling messy CSV files with columns having non-standard strings or other markers in numeric columns has been a major issue for pandas over the years.
Antoine Pitrou / @pitrou:
I'm still not convinced this is a good idea. Consider pa.array([1, 2.3])
. Should it return a union<int64, float64>
?
cc @amol- for advice.
Joris Van den Bossche / @jorisvandenbossche: Agreed that we shouldn't do that by default, but we can keep this issue about actually supporting it? Because now construction of a union array from a python sequence is not even supported when explicitly mentioning the type.
In [52]: typ = pa.union([pa.field("int", "int64"), pa.field("float", "float64")], mode="sparse")
In [53]: pa.array([1, 2.3], type=typ)
...
ArrowNotImplementedError: sparse_union
../src/arrow/util/converter.h:265 VisitTypeInline(*visitor.type, &visitor)
../src/arrow/python/python_to_arrow.cc:1015 (MakeConverter<PyConverter, PyConverterTrait>( options.type, options, pool))
I would also like pyarrow.array
to automatically convert Python values when a sparse union or dense union type is explicitly specified. I frequently use dense union types to represent data that originated in protocol buffers with oneof
fields. It is inconvenient to have to implement special handling of this case when the target Arrow schema is known.
Also, I would like to politely observe that example code snippets in previous comments are misleading, because they do not distinguish between child fields that happen to have the same data type.
Antoine Pitrou / @pitrou: I'm still not convinced this is a good idea. Consider
pa.array([1, 2.3])
. Should it return aunion<int64, float64>
?cc @amol- for advice.
Joris Van den Bossche / @jorisvandenbossche: Agreed that we shouldn't do that by default, but we can keep this issue about actually supporting it? Because now construction of a union array from a python sequence is not even supported when explicitly mentioning the type.
In [52]: typ = pa.union([pa.field("int", "int64"), pa.field("float", "float64")], mode="sparse") In [53]: pa.array([1, 2.3], type=typ) ... ArrowNotImplementedError: sparse_union ../src/arrow/util/converter.h:265 VisitTypeInline(*visitor.type, &visitor) ../src/arrow/python/python_to_arrow.cc:1015 (MakeConverter<PyConverter, PyConverterTrait>( options.type, options, pool))
As an example, consider the following union type:
>>> string_predicate_type = pa.dense_union([
... pa.field("equals", pa.string(), False),
... pa.field("regexp", pa.string(), False),
... pa.field("is_null", pa.null()),
... ])
>>> string_predicate_type
DenseUnionType(dense_union<regexp: string not null=0, regexp: string not null=1, is_null: null=2>)
Both equals
and regexp
are string, but they are semantically distinct. For pyarrow.array
to convert Python values to the correct child field type, the values ought to be tagged:
pa.array([{"equals": "foo"}, {"regexp": "[0-9a-f]{16}"}, {"is_null": None}], type=string_predicate_type)
Hi all, I'm running into this today. See #43857
I don't think we necessarily need to do automatic type inference here i.e. make pyarrow smart and opinionated enough to infer these flexible schemas. However, being able to support user-specified pa.union Schemas would be a great help and seems non-controversial since the user is opting into it by specifying the schema to begin with.
Curious on thoughts here.
CC @assignUser
It would be useful to be able to generate unions during type inference:
Reporter: Wes McKinney / @wesm
Related issues:
Note: This issue was originally created as ARROW-2774. Please see the migration documentation for further details.