apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.14k stars 3.45k forks source link

[Python] support conversion to decimal type from floats? #16805

Open asfimport opened 5 years ago

asfimport commented 5 years ago

We currently allow constructing a decimal array from decimal.Decimal objects or from ints:


In [14]: pa.array([1, 0], type=pa.decimal128(2))                                                                                                              
Out[14]: 
<pyarrow.lib.Decimal128Array object at 0x7f51fa2da818>
[
  1,
  0
]

In [31]: pa.array([decimal.Decimal('0.1'), decimal.Decimal('0.2')], pa.decimal128(2, 1))                                                                      
Out[31]: 
<pyarrow.lib.Decimal128Array object at 0x7fce671172b0>
[
  0.1,
  0.2
]

but not from floats (or strings):


In [18]: pa.array([0.1, 0.2], pa.decimal128(2))                                                                                                               
...
ArrowTypeError: int or Decimal object expected, got float

Is this something we would like to support?

There are for sure precision issues you run into, but if the decimal type is fully specified, it seems clear what the user wants. In general, since decimal objects in pandas are not that easy to work with, many people might have plain float columns that they want to convert to decimal.

Reporter: Joris Van den Bossche / @jorisvandenbossche

Related issues:

Note: This issue was originally created as ARROW-5905. Please see the migration documentation for further details.

asfimport commented 4 years ago

Antoine Pitrou / @pitrou: ARROW-7011 will allow making this by calling the Array.cast() method. Is it enough for the use case?

asfimport commented 4 years ago

Joris Van den Bossche / @jorisvandenbossche: That certainly solves the immediate need/functionality for being able to convert floats to decimal type in Arrow.

I would personally say that it would still be nice to be able to do this already upon conversion to Arrow in pa.array (which would also ensure it works when converting eg a pandas DataFrame with a float column to a pyarrow Table with a given pyarrow schema). But I suppose that once Decimal128::FromReal is added, it should also be possible to use this in the python_to_arrow.cc? (meaning, we could leave the issue open as a possible future enhancement, if we want this)

asfimport commented 4 years ago

Joris Van den Bossche / @jorisvandenbossche: Actually, also strings are not accepted (while internally python decimal objects are converted to strings first to convert them to decimal type) :


In [12]: pa.array(["0.1", "0.2"], pa.decimal128(2, 1))                                                                                                                                                             
...
ArrowTypeError: int or Decimal object expected, got str

(and casting strings to decimal doesn't work yet, that's probably worth another JIRA?)

So it's maybe a more general question: what types of values do we want to accept to construct a decimal array? Now we accept Python decimal.Decimal objects, but also ints, so why not floats or strings? After ARROW-7011, I think it would be a relatively easy addition to also accept also those types in DecimalFromPyObject (https://github.com/apache/arrow/blob/bcbb3e2c350b3889c19b3c3fdbb0a88d5c8f1cbd/cpp/src/arrow/python/decimal.cc#L148-L164).

One disadvantage might be that the object-by-object conversion in the DecimalConverter (involving Python) might be less efficient than a cast in case of a typed float array as input.

asfimport commented 4 years ago

Antoine Pitrou / @pitrou: That sounds reasonable to me, yes.