apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.48k stars 3.52k forks source link

[C++][Python] Unable to cast `date{32,64}` to `date{32,64}` #43183

Closed Fokko closed 3 months ago

Fokko commented 3 months ago

Describe the bug, including details regarding any error messages, version, and platform.

It looks like I'm able to cast ints/string:

> import pyarrow as pa

> n_legs = pa.array([2, 2, 4, 4, 5, 100])
> animals = pa.array(["Flamingo", "Parrot", "Dog", "Horse", "Brittle stars", "Centipede"])
> names = ["n_legs", "animals"]

> batch = pa.RecordBatch.from_arrays([n_legs, animals], names=names)
> batch

pyarrow.RecordBatch
n_legs: int64
animals: string
----
n_legs: [2,2,4,4,5,100]
animals: ["Flamingo","Parrot","Dog","Horse","Brittle stars","Centipede"]

> schema = pa.schema([
>     ('n_legs', pa.int64()),
>     ('animals', pa.string()),
> ])
> pa.RecordBatchReader.from_batches(
>     schema,
>     [batch]
> ).cast(schema).read_all()

pyarrow.Table
n_legs: int64
animals: string
----
n_legs: [[2,2,4,4,5,100]]
animals: [["Flamingo","Parrot","Dog","Horse","Brittle stars","Centipede"]]

But it seems to fail with a date32:

> import pyarrow as pa
> from datetime import date
> birthday = [date(1990, 3, 1)]
> names = ["Fokko"]
> batch = pa.RecordBatch.from_arrays([birthday, names], names=['birthday', 'name'])
> batch
pyarrow.RecordBatch
birthday: date32[day]
name: string
----
birthday: [1990-03-01]
name: ["Fokko"]

> schema = pa.schema([
>     ('birthday', pa.date32()),
>     ('name', pa.string()),
> ])

> pa.RecordBatchReader.from_batches(
>     schema,
>     [batch]
> ).cast(schema).read_all()

---------------------------------------------------------------------------
ArrowTypeError                            Traceback (most recent call last)
Cell In[6], line 9
      1 schema = pa.schema([
      2     ('birthday', pa.date32()),
      3     ('name', pa.string()),
      4 ])
      6 pa.RecordBatchReader.from_batches(
      7     schema,
      8     [batch]
----> 9 ).cast(schema).read_all()

File /opt/homebrew/lib/python3.10/site-packages/pyarrow/ipc.pxi:800, in pyarrow.lib.RecordBatchReader.cast()
File /opt/homebrew/lib/python3.10/site-packages/pyarrow/error.pxi:154, in pyarrow.lib.pyarrow_internal_check_status()
File /opt/homebrew/lib/python3.10/site-packages/pyarrow/error.pxi:91, in pyarrow.lib.check_status()
ArrowTypeError: Field 0 cannot be cast from date32[day] to date32[day]

Same for date64:

---------------------------------------------------------------------------
ArrowTypeError                            Traceback (most recent call last)
Cell In[42], line 15
      4 schema = pa.schema([
      5     # ('date32', pa.date32()),
      6     ('date64', pa.date64()),
      7 ])
      9 batch = pa.RecordBatch.from_arrays(data, schema=schema)
     12 table = pa.RecordBatchReader.from_batches(
     13     schema,
     14     [batch]
---> 15 ).cast(schema).read_all()
     17 assert table['date32'][0].as_py() == dt
     18 assert table['date64'][0].as_py() == dt

File /opt/homebrew/lib/python3.10/site-packages/pyarrow/ipc.pxi:800, in pyarrow.lib.RecordBatchReader.cast()
File /opt/homebrew/lib/python3.10/site-packages/pyarrow/error.pxi:154, in pyarrow.lib.pyarrow_internal_check_status()
File /opt/homebrew/lib/python3.10/site-packages/pyarrow/error.pxi:91, in pyarrow.lib.check_status()
ArrowTypeError: Field 0 cannot be cast from date64[ms] to date64[ms]

This looks like a valid cast operation to me. Please advise. Happy to create a PR, if someone can point out the place where I should add the test would be very helpful, since I'm not familiar with the codebase :)

> pa.__version__
'16.1.0'

Component(s)

C++

pitrou commented 3 months ago

Issue resolved by pull request 43192 https://github.com/apache/arrow/pull/43192