dlt-hub / verified-sources

Contribute to dlt verified sources 🔥
https://dlthub.com/docs/walkthroughs/add-a-verified-source
Apache License 2.0
72 stars 50 forks source link

SQL database source helpers has no fallback for `tz` in `row_tuples_to_arrow` #492

Closed khoadaniel closed 5 months ago

khoadaniel commented 5 months ago

dlt version

0.4.13a0

Source name

sql_database postgres

Describe the problem

When using backend pyarrow with the postgres source, pipeline.run() failed due to the missing value of tz argument. This is not convenient for users. I suggest we just go with "UTC" as a default value for tz.

╰─➤  python sql_database_pipeline.py 
2024-06-12 14:03:52,284|[INFO                 ]|38392|8625376256|dlt|pipeline.py|_restore_state_from_destination:1452|The state was restored from the destination postgres(dlt.destinations.postgres):dlt_dev
2024-06-12 14:03:52,364|[INFO                 ]|38392|8625376256|dlt|__init__.py|bind:424|Bind incremental on dev_table with initial_value: None, start_value: None, end_value: None
Traceback (most recent call last):
  File "/Users/dle/Documents/VSCode/data_platform_dlt/apps/dlt/postgres_tracking/.venv/lib/python3.9/site-packages/dlt/extract/pipe_iterator.py", line 275, in _get_source_item
    pipe_item = next(gen)
  File "/Users/dle/Documents/VSCode/data_platform_dlt/apps/dlt/postgres_tracking/sql_database/helpers.py", line 219, in table_rows
    yield from loader.load_rows(backend_kwargs)
  File "/Users/dle/Documents/VSCode/data_platform_dlt/apps/dlt/postgres_tracking/sql_database/helpers.py", line 118, in load_rows
    yield from self._load_rows(query, backend_kwargs)
  File "/Users/dle/Documents/VSCode/data_platform_dlt/apps/dlt/postgres_tracking/sql_database/helpers.py", line 140, in _load_rows
    yield row_tuples_to_arrow(
  File "/Users/dle/Documents/VSCode/data_platform_dlt/apps/dlt/postgres_tracking/sql_database/schema_types.py", line 135, in row_tuples_to_arrow
    arrow_schema = columns_to_arrow(columns, tz=tz)
  File "/Users/dle/Documents/VSCode/data_platform_dlt/apps/dlt/postgres_tracking/.venv/lib/python3.9/site-packages/dlt/common/configuration/inject.py", line 243, in _wrap
    config = resolve_config(bound_args, accept_partial_=accept_partial)
  File "/Users/dle/Documents/VSCode/data_platform_dlt/apps/dlt/postgres_tracking/.venv/lib/python3.9/site-packages/dlt/common/configuration/inject.py", line 171, in resolve_config
    return resolve_configuration(
  File "/Users/dle/Documents/VSCode/data_platform_dlt/apps/dlt/postgres_tracking/.venv/lib/python3.9/site-packages/dlt/common/configuration/resolve.py", line 67, in resolve_configuration
    return _resolve_configuration(config, sections, (), explicit_value, accept_partial)
  File "/Users/dle/Documents/VSCode/data_platform_dlt/apps/dlt/postgres_tracking/.venv/lib/python3.9/site-packages/dlt/common/configuration/resolve.py", line 157, in _resolve_configuration
    _resolve_config_fields(
  File "/Users/dle/Documents/VSCode/data_platform_dlt/apps/dlt/postgres_tracking/.venv/lib/python3.9/site-packages/dlt/common/configuration/resolve.py", line 293, in _resolve_config_fields
    raise ConfigFieldMissingException(type(config).__name__, unresolved_fields)
dlt.common.configuration.exceptions.ConfigFieldMissingException: Following fields are missing: ['tz'] in configuration with spec ColumnsToArrowConfiguration
        for field "tz" config providers and keys were tried in following order:
                In ExplicitValues key tz was not found.
Please refer to https://dlthub.com/docs/general-usage/credentials for more information

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/dle/Documents/VSCode/data_platform_dlt/apps/dlt/postgres_tracking/.venv/lib/python3.9/site-packages/dlt/pipeline/pipeline.py", line 443, in extract
    self._extract_source(
  File "/Users/dle/Documents/VSCode/data_platform_dlt/apps/dlt/postgres_tracking/.venv/lib/python3.9/site-packages/dlt/pipeline/pipeline.py", line 1141, in _extract_source
    load_id = extract.extract(
  File "/Users/dle/Documents/VSCode/data_platform_dlt/apps/dlt/postgres_tracking/.venv/lib/python3.9/site-packages/dlt/extract/extract.py", line 399, in extract
    self._extract_single_source(
  File "/Users/dle/Documents/VSCode/data_platform_dlt/apps/dlt/postgres_tracking/.venv/lib/python3.9/site-packages/dlt/extract/extract.py", line 324, in _extract_single_source
    for pipe_item in pipes:
  File "/Users/dle/Documents/VSCode/data_platform_dlt/apps/dlt/postgres_tracking/.venv/lib/python3.9/site-packages/dlt/extract/pipe_iterator.py", line 159, in __next__
    pipe_item = self._get_source_item()
  File "/Users/dle/Documents/VSCode/data_platform_dlt/apps/dlt/postgres_tracking/.venv/lib/python3.9/site-packages/dlt/extract/pipe_iterator.py", line 306, in _get_source_item
    raise ResourceExtractionError(pipe.name, gen, str(ex), "generator") from ex
dlt.extract.exceptions.ResourceExtractionError: In processing pipe dev_table: extraction of resource dev_table in generator table_rows caused an exception: Following fields are missing: ['tz'] in configuration with spec ColumnsToArrowConfiguration
        for field "tz" config providers and keys were tried in following order:
                In ExplicitValues key tz was not found.
Please refer to https://dlthub.com/docs/general-usage/credentials for more information

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/dle/Documents/VSCode/data_platform_dlt/apps/dlt/postgres_tracking/sql_database_pipeline.py", line 289, in <module>
    load_select_tables_from_database()
  File "/Users/dle/Documents/VSCode/data_platform_dlt/apps/dlt/postgres_tracking/sql_database_pipeline.py", line 34, in load_select_tables_from_database
    info = pipeline.run(source_1,
  File "/Users/dle/Documents/VSCode/data_platform_dlt/apps/dlt/postgres_tracking/.venv/lib/python3.9/site-packages/dlt/pipeline/pipeline.py", line 228, in _wrap
    step_info = f(self, *args, **kwargs)
  File "/Users/dle/Documents/VSCode/data_platform_dlt/apps/dlt/postgres_tracking/.venv/lib/python3.9/site-packages/dlt/pipeline/pipeline.py", line 273, in _wrap
    return f(self, *args, **kwargs)
  File "/Users/dle/Documents/VSCode/data_platform_dlt/apps/dlt/postgres_tracking/.venv/lib/python3.9/site-packages/dlt/pipeline/pipeline.py", line 698, in run
    self.extract(
  File "/Users/dle/Documents/VSCode/data_platform_dlt/apps/dlt/postgres_tracking/.venv/lib/python3.9/site-packages/dlt/pipeline/pipeline.py", line 228, in _wrap
    step_info = f(self, *args, **kwargs)
  File "/Users/dle/Documents/VSCode/data_platform_dlt/apps/dlt/postgres_tracking/.venv/lib/python3.9/site-packages/dlt/pipeline/pipeline.py", line 182, in _wrap
    rv = f(self, *args, **kwargs)
  File "/Users/dle/Documents/VSCode/data_platform_dlt/apps/dlt/postgres_tracking/.venv/lib/python3.9/site-packages/dlt/pipeline/pipeline.py", line 168, in _wrap
    return f(self, *args, **kwargs)
  File "/Users/dle/Documents/VSCode/data_platform_dlt/apps/dlt/postgres_tracking/.venv/lib/python3.9/site-packages/dlt/pipeline/pipeline.py", line 273, in _wrap
    return f(self, *args, **kwargs)
  File "/Users/dle/Documents/VSCode/data_platform_dlt/apps/dlt/postgres_tracking/.venv/lib/python3.9/site-packages/dlt/pipeline/pipeline.py", line 464, in extract
    raise PipelineStepFailed(
dlt.pipeline.exceptions.PipelineStepFailed: Pipeline execution failed at stage extract when processing package 1718193832.316936 with exception:

<class 'dlt.extract.exceptions.ResourceExtractionError'>
In processing pipe dev_table: extraction of resource dev_table in generator table_rows caused an exception: Following fields are missing: ['tz'] in configuration with spec ColumnsToArrowConfiguration
        for field "tz" config providers and keys were tried in following order:
                In ExplicitValues key tz was not found.
Please refer to https://dlthub.com/docs/general-usage/credentials for more information

Expected behavior

No error should be thrown. A default value of tz should be applied (e.g. "UTC").

Steps to reproduce

import dlt
from sql_database import sql_database

def load_select_tables_from_database() -> None:
    """Use the sql_database source to reflect an entire database schema and load select tables from it.
    """

    # Create a pipeline
    pipeline = dlt.pipeline(
        pipeline_name="dev_dlt_survey",
        destination='postgres',
        dataset_name="dlt_dev",
        # progress=dlt.progress.tqdm(colour="yellow"),
    )

    # Configure the source to load a few select tables incrementally
    source_1 = sql_database(schema="dlt_active_campaign",
                            backend="pyarrow").with_resources(
        "dev_table")

    source_1.dev_table.apply_hints(
        incremental=dlt.sources.incremental("sale_date"),
        primary_key="id",
    )

    info = pipeline.run(source_1,
                        write_disposition="merge",
                        refresh="drop_resources"
                        )
    print(info)

How you are using the source?

I'm considering using this source in my work, but bug is preventing this.

Operating system

macOS

Runtime environment

Local

Python version

3.9.11

dlt destination

postgres

Additional information

No response

rudolfix commented 5 months ago

@khoadaniel thanks for the report! the alpha version that you are using changes how dlt handles None passed to arguments with default values: https://github.com/dlt-hub/dlt/pull/1430#issue-2329339449

we will most probably release the current alpha as a "major" version and together with all the fixes needed in verified sources.

btw. this fix is correct :) I have exactly the same patch ready and since it is backward compatible I'll merge it before the release above