Open pwr-philarmstrong opened 2 days ago
I think that your problem is related to this: https://github.com/dlt-hub/dlt/issues/2030 which in turn unfortunately waits for a delta "bug" to be fixed.
we'll probably merge #2030 to give our users workaround (merge many files instead of a single big dataset)
ok. not sure of the details of that issue. Its worth pointing out that the file is created correctly in the normalized completed_jobs folder, but when it sends to azure it tries do it in a single block. If I use parquet and not delta the file is sent using small blocks. Also Delta files are created fine to a local filesystem its just when I want to use azure that if fails. Not sure if other cloud services have a similar issue.
dlt version
1.3.0
Describe the problem
I have a pipeline that copies a table from sql server to azure gen2 storage. It creates delta files and works fine if the parquet files are small however when they get larger I get a failure sending and it goes into a retry loop.
Logging the azure storage I can see this sort of error details
the pipeline part looks like this
and this is a chunk of the log file around where it fails though it only shows that its waiting and that if fails after 5 retries.
Expected behavior
I would expect the files to either be successfully sent to azure storage either as a single file as with the smaller files or broken into blocks.
Steps to reproduce
I've managed to create some code to generate the issue It works if the file is sent as parquet but if the format is delta it fails.
Operating system
Windows
Runtime environment
Local
Python version
3.11
dlt data source
microsoft sql server but the problem also happens with a df datasource
dlt destination
Filesystem & buckets
Other deployment details
No response
Additional information
I have a pipeline that copies a table from sql server to azure gen2 storage. It creates delta files and works fine if the parquet files are small however when they get larger I get a failure sending and it goes into a retry loop.
Logging the azure storage I can see this sort of error details
the pipeline part looks like this
Looking at the azure logs for smaller files they look like this
I also tried sending the larger parquet file using a standalone python script and the azure.storage.blob package
This worked fine and seemed to send the file in blocks the logs for one block look like this
I was also able to send pure parquet files to azure without an issue however delta seems to create larger parquet files.
I also tried adjusting a number of the dlt config items e.g.