dlt-hub / dlt

data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
https://dlthub.com/docs
Apache License 2.0
2.65k stars 176 forks source link

Incorrect S3 url when using Digital Ocean Spaces as staging and Clickhouse as destination #1897

Closed gfrmin closed 1 month ago

gfrmin commented 1 month ago

dlt version

1.1.0

Describe the problem

Using DigitalOcean S3-compatible Spaces storage as staging for loading into Clickhouse destination fails due to incorrect URL building.

Expected behavior

Data should load correctly, but instead a DB exception is thrown by Clickhouse, e.g.

2024-09-29 15:27:21,065|[ERROR]|341017|140111794464320|dlt|reference.py|run_managed:403|Transient exception in job _dlt_pipeline_state.55206d6956.reference in file /home/g/.dlt/pipelines/s3_pipeline/load/normalized/1727612187.8839986/started_jobs/_dlt_pipeline_state.55206d6956.5.reference
Traceback (most recent call last):
  File "/home/g/git/pdmdlt/.venv/lib/python3.10/site-packages/clickhouse_driver/dbapi/cursor.py", line 111, in execute
    response = execute(
  File "/home/g/git/pdmdlt/.venv/lib/python3.10/site-packages/clickhouse_driver/client.py", line 382, in execute
    rv = self.process_ordinary_query(
  File "/home/g/git/pdmdlt/.venv/lib/python3.10/site-packages/clickhouse_driver/client.py", line 580, in process_ordinary_query
    return self.receive_result(with_column_types=with_column_types,
  File "/home/g/git/pdmdlt/.venv/lib/python3.10/site-packages/clickhouse_driver/client.py", line 212, in receive_result
    return result.get_result()
  File "/home/g/git/pdmdlt/.venv/lib/python3.10/site-packages/clickhouse_driver/result.py", line 50, in get_result
    for packet in self.packet_generator:
  File "/home/g/git/pdmdlt/.venv/lib/python3.10/site-packages/clickhouse_driver/client.py", line 228, in packet_generator
    packet = self.receive_packet()
  File "/home/g/git/pdmdlt/.venv/lib/python3.10/site-packages/clickhouse_driver/client.py", line 245, in receive_packet
    raise packet.exception
clickhouse_driver.errors.ServerException: Code: 499.
DB::Exception: Failed to get object info: No response body.. HTTP response code: 302: while reading _dlt_pipeline_state/1727612187.8839986.55206d6956.jsonl: While executing S3Source. Stack trace:

0. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x000000000d0fa5bb
1. DB::S3Exception::S3Exception<String const&, unsigned long>(Aws::S3::S3Errors, fmt::v9::basic_format_string<char, fmt::v9::type_identity<String const&>::type, fmt::v9::type_identity<unsigned long>::type>, String const&, unsigned long&&) @ 0x00000000103ec34a
2. DB::S3::getObjectInfo(DB::S3::Client const&, String const&, String const&, String const&, bool, bool) @ 0x00000000103ec6a8
3. DB::S3ObjectStorage::getObjectMetadata(String const&) const @ 0x00000000103e2cfe
4. DB::StorageObjectStorageSource::KeysIterator::nextImpl(unsigned long) @ 0x0000000010337a05
5. DB::StorageObjectStorageSource::IIterator::next(unsigned long) @ 0x0000000010334ef9
6. DB::StorageObjectStorageSource::createReader(unsigned long, std::shared_ptr<DB::StorageObjectStorageSource::IIterator> const&, std::shared_ptr<DB::StorageObjectStorage::Configuration> const&, std::shared_ptr<DB::IObjectStorage> const&, DB::ReadFromFormatInfo const&, std::optional<DB::FormatSettings> const&, std::shared_ptr<DB::KeyCondition const> const&, std::shared_ptr<DB::Context const> const&, DB::SchemaCache*, std::shared_ptr<Poco::Logger> const&, unsigned long, unsigned long, bool) @ 0x00000000103329ec
7. DB::StorageObjectStorageSource::createReader() @ 0x0000000010331c87
8. DB::StorageObjectStorageSource::generate() @ 0x0000000010331e96
9. DB::ISource::tryGenerate() @ 0x0000000012fbba35
10. DB::ISource::work() @ 0x0000000012fbb4c2
11. DB::ExecutionThreadContext::executeTask() @ 0x0000000012fd5347
12. DB::PipelineExecutor::executeStepImpl(unsigned long, std::atomic<bool>*) @ 0x0000000012fc9c30
13. void std::__function::__policy_invoker<void ()>::__call_impl<std::__function::__default_alloc_func<DB::PipelineExecutor::spawnThreads()::$_0, void ()>>(std::__function::__policy_storage const*) @ 0x0000000012fcb2ae
14. ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::worker(std::__list_iterator<ThreadFromGlobalPoolImpl<false, true>, void*>) @ 0x000000000d1b345b
15. void std::__function::__policy_invoker<void ()>::__call_impl<std::__function::__default_alloc_func<ThreadFromGlobalPoolImpl<false, true>::ThreadFromGlobalPoolImpl<void ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::scheduleImpl<void>(std::function<void ()>, Priority, std::optional<unsigned long>, bool)::'lambda0'()>(void&&)::'lambda'(), void ()>>(std::__function::__policy_storage const*) @ 0x000000000d1b74b1
16. void* std::__thread_proxy[abi:v15000]<std::tuple<std::unique_ptr<std::__thread_struct, std::default_delete<std::__thread_struct>>, void ThreadPoolImpl<std::thread>::scheduleImpl<void>(std::function<void ()>, Priority, std::optional<unsigned long>, bool)::'lambda0'()>>(void*) @ 0x000000000d1b6263
17. ? @ 0x00007883062ecac3
18. ? @ 0x000078830637e850

Steps to reproduce

If bucket_url is set as "s3://bucket_name/", then Clickhouse gives error "Bucket name length is out of bounds in virtual hosted style S3 URI" because file URL is converted to "http://bucket_name.nyc3.digitaloceanspaces.com//dataset/_dlt_pipeline_state/1727612187.8839986.55206d6956.jsonl"

Setting bucket_url as "s3://bucket_name" still doesn't work, as http endpoint is called (even if endpoint is endpoint_url is set with https), which receives a 302 HTTP status that is not acted upon.

Operating system

Linux

Runtime environment

Local

Python version

3.10

dlt data source

No response

dlt destination

No response

Other deployment details

No response

Additional information

By setting use_https=True in clickhouse.py the 302 problem is fixed. I also recommend dealing with double slashes (i.e. //) that are built in URLs.

sh-rp commented 1 month ago

@gfrmin before I start looking at this: I don't quite understand from your description wether you got it to work with the use_https? Was that the thing that you needed to do so it runs?

gfrmin commented 1 month ago

Two things are required:

1) use_https = True 2) to set bucket_url without a trailing slash, i.e. s3://bucket_name and not s3://bucket_name/

sh-rp commented 1 month ago

@gfrmin: I have made a PR where the use_https now is configurable and is set to true by default. If you like you can give it a spin to see wether it works alright. As for the bucket url, this should always be given without a trailing slash and i you see any mentions in the docs where there is a trailing slash let me know, then I will fix that too.