dlt-hub / dlt

data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
https://dlthub.com/docs
Apache License 2.0
2.65k stars 176 forks source link

default Azure/AWS credentials not working with `delta` table format on `filesystem` destination #2055

Open jorritsandbrink opened 3 days ago

jorritsandbrink commented 3 days ago

dlt version

1.3.1a1

Describe the problem

Using default Azure/AWS credentials leads to error when using delta table format on filesystem destination. Works fine when using e.g. parquet instead.

For GCP this has recently been fixed in https://github.com/dlt-hub/dlt/issues/1965.

Expected behavior

Default Azure/AWS credentials are handled properly and can be used to authenticate.

Steps to reproduce

For Azure:

import os
import dlt

# set dlt env vars
os.environ["CREDENTIALS__AZURE_STORAGE_ACCOUNT_NAME"] = "dltdata"
os.environ["BUCKET_URL"] = "az://dlt-ci-test-bucket"

# set default Azure credentials
os.environ["AZURE_TENANT_ID"] = "MY_TENANT_ID"
os.environ["AZURE_CLIENT_ID"] ="MY_CLIENT_ID"
os.environ["AZURE_CLIENT_SECRET"] = "MY_CLIENT_SECRET"

pipe = dlt.pipeline("my_pipe", destination="filesystem")

pipe.run([{"foo": 1}], table_name="my_table", table_format="delta")

Traceback:

2024-11-13 10:28:18,585|[ERROR]|66899|140167369590464|dlt|reference.py|run_managed:431|Transient exception in job my_table.50fe02e280.reference in file /home/j/.dlt/pipelines/my_pipe/load/normalized/1731479294.7018855/started_jobs/my_table.50fe02e280.0.reference
Traceback (most recent call last):
  File "/home/j/repos/dlt/dlt/common/destination/reference.py", line 422, in run_managed
    self.run()
  File "/home/j/repos/dlt/dlt/destinations/impl/filesystem/filesystem.py", line 143, in run
    delta_table = self._delta_table()
  File "/home/j/repos/dlt/dlt/destinations/impl/filesystem/filesystem.py", line 187, in _delta_table
    if DeltaTable.is_deltatable(self.make_remote_url(), storage_options=self._storage_options):
  File "/home/j/.cache/pypoetry/virtualenvs/dlt-2tG_aB2A-py3.9/lib/python3.9/site-packages/deltalake/table.py", line 436, in is_deltatable
    return RawDeltaTable.is_deltatable(table_uri, storage_options)
TypeError: argument 'storage_options': 'bool' object cannot be converted to 'PyString'
2024-11-13 10:28:18,586|[WARNING]|66899|140167860366208|dlt|load.py|complete_jobs:430|Job for my_table.50fe02e280.reference retried in load 1731479294.7018855 with message argument 'storage_options': 'bool' object cannot be converted to 'PyString'
2024-11-13 10:28:18,587|[ERROR]|66899|140167369590464|dlt|reference.py|run_managed:431|Transient exception in job my_table.50fe02e280.reference in file /home/j/.dlt/pipelines/my_pipe/load/normalized/1731479294.7018855/started_jobs/my_table.50fe02e280.1.reference
Traceback (most recent call last):
  File "/home/j/repos/dlt/dlt/common/destination/reference.py", line 422, in run_managed
    self.run()
  File "/home/j/repos/dlt/dlt/destinations/impl/filesystem/filesystem.py", line 143, in run
    delta_table = self._delta_table()
  File "/home/j/repos/dlt/dlt/destinations/impl/filesystem/filesystem.py", line 187, in _delta_table
    if DeltaTable.is_deltatable(self.make_remote_url(), storage_options=self._storage_options):
  File "/home/j/.cache/pypoetry/virtualenvs/dlt-2tG_aB2A-py3.9/lib/python3.9/site-packages/deltalake/table.py", line 436, in is_deltatable
    return RawDeltaTable.is_deltatable(table_uri, storage_options)
TypeError: argument 'storage_options': 'bool' object cannot be converted to 'PyString'
2024-11-13 10:28:18,587|[WARNING]|66899|140167860366208|dlt|load.py|complete_jobs:430|Job for my_table.50fe02e280.reference retried in load 1731479294.7018855 with message argument 'storage_options': 'bool' object cannot be converted to 'PyString'
2024-11-13 10:28:18,589|[ERROR]|66899|140167369590464|dlt|reference.py|run_managed:431|Transient exception in job my_table.50fe02e280.reference in file /home/j/.dlt/pipelines/my_pipe/load/normalized/1731479294.7018855/started_jobs/my_table.50fe02e280.2.reference
Traceback (most recent call last):
  File "/home/j/repos/dlt/dlt/common/destination/reference.py", line 422, in run_managed
    self.run()
  File "/home/j/repos/dlt/dlt/destinations/impl/filesystem/filesystem.py", line 143, in run
    delta_table = self._delta_table()
  File "/home/j/repos/dlt/dlt/destinations/impl/filesystem/filesystem.py", line 187, in _delta_table
    if DeltaTable.is_deltatable(self.make_remote_url(), storage_options=self._storage_options):
  File "/home/j/.cache/pypoetry/virtualenvs/dlt-2tG_aB2A-py3.9/lib/python3.9/site-packages/deltalake/table.py", line 436, in is_deltatable
    return RawDeltaTable.is_deltatable(table_uri, storage_options)
TypeError: argument 'storage_options': 'bool' object cannot be converted to 'PyString'
2024-11-13 10:28:18,589|[WARNING]|66899|140167860366208|dlt|load.py|complete_jobs:430|Job for my_table.50fe02e280.reference retried in load 1731479294.7018855 with message argument 'storage_options': 'bool' object cannot be converted to 'PyString'
2024-11-13 10:28:18,590|[ERROR]|66899|140167369590464|dlt|reference.py|run_managed:431|Transient exception in job my_table.50fe02e280.reference in file /home/j/.dlt/pipelines/my_pipe/load/normalized/1731479294.7018855/started_jobs/my_table.50fe02e280.3.reference
Traceback (most recent call last):
  File "/home/j/repos/dlt/dlt/common/destination/reference.py", line 422, in run_managed
    self.run()
  File "/home/j/repos/dlt/dlt/destinations/impl/filesystem/filesystem.py", line 143, in run
    delta_table = self._delta_table()
  File "/home/j/repos/dlt/dlt/destinations/impl/filesystem/filesystem.py", line 187, in _delta_table
    if DeltaTable.is_deltatable(self.make_remote_url(), storage_options=self._storage_options):
  File "/home/j/.cache/pypoetry/virtualenvs/dlt-2tG_aB2A-py3.9/lib/python3.9/site-packages/deltalake/table.py", line 436, in is_deltatable
    return RawDeltaTable.is_deltatable(table_uri, storage_options)
TypeError: argument 'storage_options': 'bool' object cannot be converted to 'PyString'
2024-11-13 10:28:18,590|[WARNING]|66899|140167860366208|dlt|load.py|complete_jobs:430|Job for my_table.50fe02e280.reference retried in load 1731479294.7018855 with message argument 'storage_options': 'bool' object cannot be converted to 'PyString'
2024-11-13 10:28:18,591|[ERROR]|66899|140167369590464|dlt|reference.py|run_managed:431|Transient exception in job my_table.50fe02e280.reference in file /home/j/.dlt/pipelines/my_pipe/load/normalized/1731479294.7018855/started_jobs/my_table.50fe02e280.4.reference
Traceback (most recent call last):
  File "/home/j/repos/dlt/dlt/common/destination/reference.py", line 422, in run_managed
    self.run()
  File "/home/j/repos/dlt/dlt/destinations/impl/filesystem/filesystem.py", line 143, in run
    delta_table = self._delta_table()
  File "/home/j/repos/dlt/dlt/destinations/impl/filesystem/filesystem.py", line 187, in _delta_table
    if DeltaTable.is_deltatable(self.make_remote_url(), storage_options=self._storage_options):
  File "/home/j/.cache/pypoetry/virtualenvs/dlt-2tG_aB2A-py3.9/lib/python3.9/site-packages/deltalake/table.py", line 436, in is_deltatable
    return RawDeltaTable.is_deltatable(table_uri, storage_options)
TypeError: argument 'storage_options': 'bool' object cannot be converted to 'PyString'
2024-11-13 10:28:18,593|[WARNING]|66899|140167860366208|dlt|load.py|complete_jobs:430|Job for my_table.50fe02e280.reference retried in load 1731479294.7018855 with message argument 'storage_options': 'bool' object cannot be converted to 'PyString'
Traceback (most recent call last):
  File "/home/j/repos/dlt/dlt/pipeline/pipeline.py", line 605, in load
    runner.run_pool(load_step.config, load_step)
  File "/home/j/repos/dlt/dlt/common/runners/pool_runner.py", line 91, in run_pool
    while _run_func():
  File "/home/j/repos/dlt/dlt/common/runners/pool_runner.py", line 84, in _run_func
    run_metrics = run_f.run(cast(TExecutor, pool))
  File "/home/j/repos/dlt/dlt/load/load.py", line 638, in run
    self.load_single_package(load_id, schema)
  File "/home/j/repos/dlt/dlt/load/load.py", line 597, in load_single_package
    raise pending_exception
dlt.load.exceptions.LoadClientJobRetry: Job for my_table.50fe02e280.reference had 5 retries which a multiple of 5. Exiting retry loop. You can still rerun the load package to retry this job. Last failure message was argument 'storage_options': 'bool' object cannot be converted to 'PyString'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/j/repos/dlt/mre.py", line 16, in <module>
    pipe.run([{"foo": 1}], table_name="my_table", table_format="delta")
  File "/home/j/repos/dlt/dlt/pipeline/pipeline.py", line 223, in _wrap
    step_info = f(self, *args, **kwargs)
  File "/home/j/repos/dlt/dlt/pipeline/pipeline.py", line 272, in _wrap
    return f(self, *args, **kwargs)
  File "/home/j/repos/dlt/dlt/pipeline/pipeline.py", line 744, in run
    return self.load(destination, dataset_name, credentials=credentials)
  File "/home/j/repos/dlt/dlt/pipeline/pipeline.py", line 223, in _wrap
    step_info = f(self, *args, **kwargs)
  File "/home/j/repos/dlt/dlt/pipeline/pipeline.py", line 163, in _wrap
    return f(self, *args, **kwargs)
  File "/home/j/repos/dlt/dlt/pipeline/pipeline.py", line 272, in _wrap
    return f(self, *args, **kwargs)
  File "/home/j/repos/dlt/dlt/pipeline/pipeline.py", line 612, in load
    raise PipelineStepFailed(
dlt.pipeline.exceptions.PipelineStepFailed: Pipeline execution failed at stage load when processing package 1731479294.7018855 with exception:

<class 'dlt.load.exceptions.LoadClientJobRetry'>
Job for my_table.50fe02e280.reference had 5 retries which a multiple of 5. Exiting retry loop. You can still rerun the load package to retry this job. Last failure message was argument 'storage_options': 'bool' object cannot be converted to 'PyString'

Operating system

Linux

Runtime environment

Local

Python version

3.9

dlt data source

No response

dlt destination

Filesystem & buckets

Other deployment details

No response

Additional information

I did not test AWS default credentials, but I assume they won't work either since I never wrote logic to handle them.