Closed killerfurbel closed 1 year ago
Also seeing this issue. v0.9.2-dev83 739da94553.
I've attached example.txt
that can be run to illustrate the issue.
example.txt
and rename to example.sql
(why doesn't github support .sql files?). example.sql
file to a folder. duckdb.exe test.duckdb < example.sql
The first time you run it will take some time to build the customer table. After that it should run faster.
The original file sets PRAGMA threads=1;
and everything should work as expected (as mentioned above). The folders and files are all created correctly and count from the partitioned files matches the source table.
Edit example.sql
and set threads to a higher value. I've included output for threads = 1,2,4,6. With threads > 1, you'll see errors like Error: near line 45: IO Error: Could not create directory: 'customer\h=3'
and the counts of partitioned data will not match the table count.
example.txt 1_threads.txt 2_threads.txt 4_threads.txt 6_threads.txt
I just verified your example and can also see the same errors:
┌────────────────────────────┐
│ current_setting('threads') │
│ int64 │
├────────────────────────────┤
│ 6 │
└────────────────────────────┘
...
first time copy to empty partition customer folder (expect error)
Error: near line 45: IO Error: Could not create directory: 'customer\h=3'
Since you mentioned the OVERWRITE_OR_IGNORE Parameter: I found it useful to add data to an existing partitioned dataset. If the new data is distinct from the existing data (only adding new partitions, not touching the existing ones), it works well for adding.
However, I've never tried it for updating/replacing data. As far as I've seen, in the folders there is a datafile created per thread. If the paramter would be replacing the files, this would lead to duplicate data, if the first run had a higher number of threads than the following runs? (i.e. Creating the dataset with 8 threads creates 8 data files; adding with 4 threads will replace the first 4 data files, but leaves the last 4 data files untouched?)
I just verified your example and can also see the same errors:
┌────────────────────────────┐ │ current_setting('threads') │ │ int64 │ ├────────────────────────────┤ │ 6 │ └────────────────────────────┘ ... first time copy to empty partition customer folder (expect error) Error: near line 45: IO Error: Could not create directory: 'customer\h=3'
Kewl :-) It's not just the create directory error though...I think there are more errors behind the scenes. The reported folder typically ends up getting created...others not reported do not, and some files are not created which leads to the reduced count when getting count from partitioned data.
Since you mentioned the OVERWRITE_OR_IGNORE Parameter: I found it useful to add data to an existing partitioned dataset. If the new data is distinct from the existing data (only adding new partitions, not touching the existing ones), it works well for adding.
Just getting started with/learning partitioning with DuckDB...I've used it before with Apache Drill it made a big difference with big data sets. But I think DuckDB is a good replacement for Drill for local processing as long as you have memory and disk space...anyway yeah been trying to figure out how the OVERWRITE_OR_IGNORE black box is working by poking it...
However, I've never tried it for updating/replacing data. As far as I've seen, in the folders there is a datafile created per thread. If the paramter would be replacing the files, this would lead to duplicate data, if the first run had a higher number of threads than the following runs? (i.e. Creating the dataset with 8 threads creates 8 data files; adding with 4 threads will replace the first 4 data files, but leaves the last 4 data files untouched?)
Yeah seeing the per thread files, but a fair amount of them are empty...maybe not a surprise given my test table isn't huge. In some cases I don't always get files for all thread (e.g. with threads=6 I don't always have 6 files per folder). And yeah I think it is just allowing process to overwrite any existing file but not getting rid of extra files. So switching from higher to lower threads value is probably not optimal...
Thanks for trying my script out. At some point I'll probably try to debug this myself...if the DuckDB folks don't beat me to it first...
What happens?
When writing hive partitioned parquet Files, you often get an IO Error, stating that one of the partitioned folders cannot be created:
When executing the command multiple times, the folder is usually different (e.g. first time is
FL_DATE=2006-01-09
, next time isFL_DATE=2006-01-04
, kind of random...).However, it does not happen when the source file is read via the httpfs module (could be a timing issue?).
No issue:
Fails ~90% of the time if you download the .parquet File first and then execute:
I also noticed that the error does not occur, if I set the threads to 1:
To Reproduce
C:\temp\
folderError: IO Error: Could not create directory: 'C:\temp\flights\FL_DATE=2006-01-09'
most of the time (be sure to deleteC:\temp\flights\
before the next try)OS:
Windows x64
DuckDB Version:
v0.9.2-dev23 6eeb6826bb
DuckDB Client:
CLI
Full Name:
Fabian Krüger
Affiliation:
IBIS Prof. Thome AG
Have you tried this on the latest
main
branch?I have tested with a main build
Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?