apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.34k stars 3.48k forks source link

[Python] Implement pa.array() with type=union type #19157

Open asfimport opened 6 years ago

asfimport commented 6 years ago

It would be useful to be able to generate unions during type inference:


In [11]: pa.array([{'a': 1, 'b': 'string'}, {'b': 2}])
---------------------------------------------------------------------------
ArrowTypeError                            Traceback (most recent call last)
<ipython-input-11-c554b698271b> in <module>()
----> 1 pa.array([{'a': 1, 'b': 'string'}, {'b': 2}])

~/code/arrow/python/pyarrow/array.pxi in pyarrow.lib.array()
    179         if mask is not None:
    180             raise ValueError("Masks only supported with ndarray-like inputs")
--> 181         return _sequence_to_array(obj, size, type, pool)
    182 
    183 

~/code/arrow/python/pyarrow/array.pxi in pyarrow.lib._sequence_to_array()
     24         if size is None:
     25             with nogil:
---> 26                 check_status(ConvertPySequence(sequence, pool, &out))
     27         else:
     28             c_size = size

~/code/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()
     89             raise ArrowNotImplementedError(message)
     90         elif status.IsTypeError():
---> 91             raise ArrowTypeError(message)
     92         elif status.IsCapacityError():
     93             raise ArrowCapacityError(message)

ArrowTypeError: ../src/arrow/python/builtin_convert.cc:794 code: AppendPySequence(seq, size, real_type, builder.get())
../src/arrow/python/iterators.h:60 code: func(value)
../src/arrow/python/builtin_convert.cc:619 code: value_converters_[i]->AppendSingle(valueobj ? valueobj : Py_None)
../src/arrow/python/builtin_convert.cc:414 code: internal::CIntFromPython(obj, &value)
../src/arrow/python/helpers.cc:259 code: CheckPyError()
an integer is required (got type str)

In [12]: pa.array([5, 'str', False])
---------------------------------------------------------------------------
ArrowTypeError                            Traceback (most recent call last)
<ipython-input-12-9e3343f08351> in <module>()
----> 1 pa.array([5, 'str', False])

~/code/arrow/python/pyarrow/array.pxi in pyarrow.lib.array()
    179         if mask is not None:
    180             raise ValueError("Masks only supported with ndarray-like inputs")
--> 181         return _sequence_to_array(obj, size, type, pool)
    182 
    183 

~/code/arrow/python/pyarrow/array.pxi in pyarrow.lib._sequence_to_array()
     24         if size is None:
     25             with nogil:
---> 26                 check_status(ConvertPySequence(sequence, pool, &out))
     27         else:
     28             c_size = size

~/code/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()
     89             raise ArrowNotImplementedError(message)
     90         elif status.IsTypeError():
---> 91             raise ArrowTypeError(message)
     92         elif status.IsCapacityError():
     93             raise ArrowCapacityError(message)

ArrowTypeError: ../src/arrow/python/builtin_convert.cc:794 code: AppendPySequence(seq, size, real_type, builder.get())
../src/arrow/python/iterators.h:60 code: func(value)
../src/arrow/python/builtin_convert.cc:414 code: internal::CIntFromPython(obj, &value)
../src/arrow/python/helpers.cc:259 code: CheckPyError()
an integer is required (got type str)

Reporter: Wes McKinney / @wesm

Related issues:

Note: This issue was originally created as ARROW-2774. Please see the migration documentation for further details.

asfimport commented 6 years ago

Antoine Pitrou / @pitrou: I'm not sure it's a good idea to do this by default, since it would hide problems with unsanitized input. Intuitively, I don't think real "union"-type data is frequent in the real world (as opposed to unsanitized data).

Also, we would have to choose a union kind (dense or sparse).

asfimport commented 6 years ago

Uwe Korn / @xhochy: This is definitely something users would like to see but I would also like to see this hidden behind a flag. Being able to deal with unsanitized input is often a typical pandas use case in exploratory data analysis but once you use this as part of a production pipeline, you rather want to have it error.

asfimport commented 6 years ago

Wes McKinney / @wesm: I agree that disabling by default is a good idea. Maybe we can have an allow_unions=True flag

asfimport commented 6 years ago

Wes McKinney / @wesm: Unions will be really helpful when we get to working on CSV reading – handling messy CSV files with columns having non-standard strings or other markers in numeric columns has been a major issue for pandas over the years.

asfimport commented 3 years ago

Antoine Pitrou / @pitrou: I'm still not convinced this is a good idea. Consider pa.array([1, 2.3]). Should it return a union<int64, float64>?

cc @amol- for advice.

asfimport commented 3 years ago

Joris Van den Bossche / @jorisvandenbossche: Agreed that we shouldn't do that by default, but we can keep this issue about actually supporting it? Because now construction of a union array from a python sequence is not even supported when explicitly mentioning the type.


In [52]: typ = pa.union([pa.field("int", "int64"), pa.field("float", "float64")], mode="sparse")

In [53]: pa.array([1, 2.3], type=typ)
...
ArrowNotImplementedError: sparse_union
../src/arrow/util/converter.h:265  VisitTypeInline(*visitor.type, &visitor)
../src/arrow/python/python_to_arrow.cc:1015  (MakeConverter<PyConverter, PyConverterTrait>( options.type, options, pool))
chairmank commented 1 year ago

I would also like pyarrow.array to automatically convert Python values when a sparse union or dense union type is explicitly specified. I frequently use dense union types to represent data that originated in protocol buffers with oneof fields. It is inconvenient to have to implement special handling of this case when the target Arrow schema is known.

Also, I would like to politely observe that example code snippets in previous comments are misleading, because they do not distinguish between child fields that happen to have the same data type.

Antoine Pitrou / @pitrou: I'm still not convinced this is a good idea. Consider pa.array([1, 2.3]). Should it return a union<int64, float64>?

cc @amol- for advice.

Joris Van den Bossche / @jorisvandenbossche: Agreed that we shouldn't do that by default, but we can keep this issue about actually supporting it? Because now construction of a union array from a python sequence is not even supported when explicitly mentioning the type.

In [52]: typ = pa.union([pa.field("int", "int64"), pa.field("float", "float64")], mode="sparse")

In [53]: pa.array([1, 2.3], type=typ)
...
ArrowNotImplementedError: sparse_union
../src/arrow/util/converter.h:265  VisitTypeInline(*visitor.type, &visitor)
../src/arrow/python/python_to_arrow.cc:1015  (MakeConverter<PyConverter, PyConverterTrait>( options.type, options, pool))

As an example, consider the following union type:

>>> string_predicate_type = pa.dense_union([
...     pa.field("equals", pa.string(), False),
...     pa.field("regexp", pa.string(), False),
...     pa.field("is_null", pa.null()),
... ])
>>> string_predicate_type
DenseUnionType(dense_union<regexp: string not null=0, regexp: string not null=1, is_null: null=2>)

Both equals and regexp are string, but they are semantically distinct. For pyarrow.array to convert Python values to the correct child field type, the values ought to be tagged:

pa.array([{"equals": "foo"}, {"regexp": "[0-9a-f]{16}"}, {"is_null": None}], type=string_predicate_type)
nishkakar commented 1 month ago

Hi all, I'm running into this today. See #43857

I don't think we necessarily need to do automatic type inference here i.e. make pyarrow smart and opinionated enough to infer these flexible schemas. However, being able to support user-specified pa.union Schemas would be a great help and seems non-controversial since the user is opting into it by specifying the schema to begin with.

Curious on thoughts here.

CC @assignUser