dlt-hub / dlt

data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
https://dlthub.com/docs
Apache License 2.0
2.38k stars 154 forks source link

Error for incremental loading with pyarrow backend and datetime.date initial value #1737

Closed VioletM closed 1 month ago

VioletM commented 1 month ago

dlt version

0.5.3

Describe the problem

When loading data from sql database using pyarrow backend, I've specified incremental variable the following way:

sql_table_resource.apply_hints(incremental=dlt.sources.incremental(
    cursor_path="my_column",
    initial_value=datetime.date(),
    ...
))

During the loading, I've got the following error:

Traceback (most recent call last):
  File "/opt/anaconda3/envs/elfmk/lib/python3.10/site-packages/dlt/extract/pipe_iterator.py", line 225, in __next__
    next_item = step(item, meta=pipe_item.meta)  # type: ignore
  File "/opt/anaconda3/envs/elfmk/lib/python3.10/site-packages/dlt/extract/incremental/__init__.py", line 649, in __call__
    return self._incremental(item, meta)
  File "/opt/anaconda3/envs/elfmk/lib/python3.10/site-packages/dlt/extract/incremental/__init__.py", line 478, in __call__
    rows = self._transform_item(transformer, rows)
  File "/opt/anaconda3/envs/elfmk/lib/python3.10/site-packages/dlt/extract/incremental/__init__.py", line 329, in _transform_item
    row, self.start_out_of_range, self.end_out_of_range = transformer(row)
  File "/opt/anaconda3/envs/elfmk/lib/python3.10/site-packages/dlt/extract/incremental/transform.py", line 304, in __call__
    start_value_scalar = to_arrow_scalar(self.start_value, cursor_data_type)
  File "/opt/anaconda3/envs/elfmk/lib/python3.10/site-packages/dlt/common/libs/pyarrow.py", line 402, in to_arrow_scalar
    return pyarrow.scalar(value, type=arrow_type)
  File "pyarrow/scalar.pxi", line 1150, in pyarrow.lib.scalar
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: object of type <class 'datetime.date'> cannot be converted to int

Expected behavior

I expect incremental loading works with datetime.date initial values.

Steps to reproduce

I believe that root cause is in dlt/extract/incremental/transform.py the from_arrow_scalar and to_arrow_scalar have some side effect for datetime.date.

image

Operating system

Linux

Runtime environment

Local

Python version

3.10

dlt data source

No response

dlt destination

No response

Other deployment details

No response

Additional information

No response