dlt-hub / dlt

data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
https://dlthub.com/docs
Apache License 2.0
2.65k stars 175 forks source link

Support incremental load with Arrow when cursor_column is not nullable #1790

Closed willi-mueller closed 2 months ago

willi-mueller commented 2 months ago

dlt version

unreleased devel branch

Describe the problem

When the cursor_column is not nullable, then the incremental loading fails with the following error:

E           <class 'dlt.extract.exceptions.ResourceExtractionError'>
E           In processing pipe some_data: extraction of resource some_data in transform IncrementalResourceWrapper caused an exception: cannot access local variable 'tbl_without_null' where it is not associated with a value

Expected behavior

No response

Steps to reproduce

    data = [
        {"id": 1, "created_at": 1},
        {"id": 2, "created_at": 2},
        {"id": 3, "created_at": 2},
    ]
    schema = pa.schema([
        pa.field('id', pa.int32(), nullable=False),
        pa.field('created_at', pa.int32(), nullable=False)
    ])
    id_array = pa.array([item['id'] for item in data], type=pa.int32())
    created_at_array = pa.array([item['created_at'] for item in data], type=pa.int32())
    source_items = pa.Table.from_arrays([id_array, created_at_array], schema=schema)

    @dlt.resource
    def some_data(
        created_at=dlt.sources.incremental("created_at", on_cursor_value_missing="include")
    ):
        yield source_items

    p = dlt.pipeline(pipeline_name=uniq_id())
    p.run(some_data(), destination="duckdb")

Operating system

macOS

Runtime environment

Local

Python version

3.11

dlt data source

No response

dlt destination

No response

Other deployment details

No response

Additional information

No response