Closed gfrmin closed 1 month ago
@gfrmin before I start looking at this: I don't quite understand from your description wether you got it to work with the use_https? Was that the thing that you needed to do so it runs?
Two things are required:
1) use_https = True
2) to set bucket_url without a trailing slash, i.e. s3://bucket_name
and not s3://bucket_name/
@gfrmin: I have made a PR where the use_https now is configurable and is set to true by default. If you like you can give it a spin to see wether it works alright. As for the bucket url, this should always be given without a trailing slash and i you see any mentions in the docs where there is a trailing slash let me know, then I will fix that too.
dlt version
1.1.0
Describe the problem
Using DigitalOcean S3-compatible Spaces storage as staging for loading into Clickhouse destination fails due to incorrect URL building.
Expected behavior
Data should load correctly, but instead a DB exception is thrown by Clickhouse, e.g.
Steps to reproduce
If bucket_url is set as "s3://bucket_name/", then Clickhouse gives error "Bucket name length is out of bounds in virtual hosted style S3 URI" because file URL is converted to "http://bucket_name.nyc3.digitaloceanspaces.com//dataset/_dlt_pipeline_state/1727612187.8839986.55206d6956.jsonl"
Setting bucket_url as "s3://bucket_name" still doesn't work, as http endpoint is called (even if endpoint is endpoint_url is set with https), which receives a 302 HTTP status that is not acted upon.
Operating system
Linux
Runtime environment
Local
Python version
3.10
dlt data source
No response
dlt destination
No response
Other deployment details
No response
Additional information
By setting
use_https=True
inclickhouse.py
the 302 problem is fixed. I also recommend dealing with double slashes (i.e. //) that are built in URLs.