duckdb / duckdb

DuckDB is an analytical in-process SQL database management system
http://www.duckdb.org
MIT License
24.15k stars 1.91k forks source link

"IO Error: Could not create directory" when writing hive partitioned parquet files #9360

Closed killerfurbel closed 1 year ago

killerfurbel commented 1 year ago

What happens?

When writing hive partitioned parquet Files, you often get an IO Error, stating that one of the partitioned folders cannot be created:

v0.9.2-dev23 6eeb6826bb
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D copy (select * from 'C:\temp\Flights-1m.parquet')
> to 'C:\temp\flights' (FORMAT 'PARQUET', CODEC 'ZSTD', PARTITION_BY(FL_DATE), OVERWRITE_OR_IGNORE 1);
Error: IO Error: Could not create directory: 'C:\temp\flights\FL_DATE=2006-01-09'
D

When executing the command multiple times, the folder is usually different (e.g. first time is FL_DATE=2006-01-09, next time is FL_DATE=2006-01-04, kind of random...).

However, it does not happen when the source file is read via the httpfs module (could be a timing issue?).

No issue:

copy (select * from 'https://www.tablab.app/sample-datasets/flights-1m.parquet')
to 'C:\temp\flights' (FORMAT 'PARQUET', CODEC 'ZSTD', PARTITION_BY(FL_DATE), OVERWRITE_OR_IGNORE 1);

Fails ~90% of the time if you download the .parquet File first and then execute:

copy (select * from 'C:\temp\Flights-1m.parquet')
to 'C:\temp\flights' (FORMAT 'PARQUET', CODEC 'ZSTD', PARTITION_BY(FL_DATE), OVERWRITE_OR_IGNORE 1);

I also noticed that the error does not occur, if I set the threads to 1:

SET threads TO 1;

To Reproduce

OS:

Windows x64

DuckDB Version:

v0.9.2-dev23 6eeb6826bb

DuckDB Client:

CLI

Full Name:

Fabian Krüger

Affiliation:

IBIS Prof. Thome AG

Have you tried this on the latest main branch?

I have tested with a main build

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

bucweat commented 1 year ago

Also seeing this issue. v0.9.2-dev83 739da94553.

I've attached example.txt that can be run to illustrate the issue.

  1. Download example.txt and rename to example.sql (why doesn't github support .sql files?).
  2. Copy the example.sql file to a folder.
  3. Open a windows cmd prompt in that folder.
  4. Run the following:
    duckdb.exe test.duckdb < example.sql

    The first time you run it will take some time to build the customer table. After that it should run faster.

The original file sets PRAGMA threads=1; and everything should work as expected (as mentioned above). The folders and files are all created correctly and count from the partitioned files matches the source table.

Edit example.sql and set threads to a higher value. I've included output for threads = 1,2,4,6. With threads > 1, you'll see errors like Error: near line 45: IO Error: Could not create directory: 'customer\h=3' and the counts of partitioned data will not match the table count.

example.txt 1_threads.txt 2_threads.txt 4_threads.txt 6_threads.txt

killerfurbel commented 1 year ago

I just verified your example and can also see the same errors:

┌────────────────────────────┐
│ current_setting('threads') │
│           int64            │
├────────────────────────────┤
│                          6 │
└────────────────────────────┘
...
first time copy to empty partition customer folder (expect error)
Error: near line 45: IO Error: Could not create directory: 'customer\h=3'

Since you mentioned the OVERWRITE_OR_IGNORE Parameter: I found it useful to add data to an existing partitioned dataset. If the new data is distinct from the existing data (only adding new partitions, not touching the existing ones), it works well for adding.

However, I've never tried it for updating/replacing data. As far as I've seen, in the folders there is a datafile created per thread. If the paramter would be replacing the files, this would lead to duplicate data, if the first run had a higher number of threads than the following runs? (i.e. Creating the dataset with 8 threads creates 8 data files; adding with 4 threads will replace the first 4 data files, but leaves the last 4 data files untouched?)

bucweat commented 1 year ago

I just verified your example and can also see the same errors:

┌────────────────────────────┐
│ current_setting('threads') │
│           int64            │
├────────────────────────────┤
│                          6 │
└────────────────────────────┘
...
first time copy to empty partition customer folder (expect error)
Error: near line 45: IO Error: Could not create directory: 'customer\h=3'

Kewl :-) It's not just the create directory error though...I think there are more errors behind the scenes. The reported folder typically ends up getting created...others not reported do not, and some files are not created which leads to the reduced count when getting count from partitioned data.

Since you mentioned the OVERWRITE_OR_IGNORE Parameter: I found it useful to add data to an existing partitioned dataset. If the new data is distinct from the existing data (only adding new partitions, not touching the existing ones), it works well for adding.

Just getting started with/learning partitioning with DuckDB...I've used it before with Apache Drill it made a big difference with big data sets. But I think DuckDB is a good replacement for Drill for local processing as long as you have memory and disk space...anyway yeah been trying to figure out how the OVERWRITE_OR_IGNORE black box is working by poking it...

However, I've never tried it for updating/replacing data. As far as I've seen, in the folders there is a datafile created per thread. If the paramter would be replacing the files, this would lead to duplicate data, if the first run had a higher number of threads than the following runs? (i.e. Creating the dataset with 8 threads creates 8 data files; adding with 4 threads will replace the first 4 data files, but leaves the last 4 data files untouched?)

Yeah seeing the per thread files, but a fair amount of them are empty...maybe not a surprise given my test table isn't huge. In some cases I don't always get files for all thread (e.g. with threads=6 I don't always have 6 files per folder). And yeah I think it is just allowing process to overwrite any existing file but not getting rid of extra files. So switching from higher to lower threads value is probably not optimal...

Thanks for trying my script out. At some point I'll probably try to debug this myself...if the DuckDB folks don't beat me to it first...