duckdb / duckdb

DuckDB is an analytical in-process SQL database management system
http://www.duckdb.org
MIT License
21.05k stars 1.7k forks source link

COPY FROM not allowing union_by_name=True and not checking the schema from table #12870

Closed arpit94 closed 1 week ago

arpit94 commented 2 weeks ago

What happens?

I am trying to run a COPY FROM command to pull data from Parquet files in S3 and load it into a PostgreSQL database. The S3 directory contains multiple files.

I am encountering the following error:

failed to cast column <column_name> from type INTEGER to VARCHAR[]: Unimplemented type for cast (INTEGER -> VARCHAR[])

Additionally

This can happen when reading multiple Parquet files. The schema information is taken from the first Parquet file by default. Possible solutions:
* Enable the union_by_name=True option to combine the schema of all Parquet files (duckdb.org/docs/data/multiple_files/combining_schemas)
* Use a COPY statement to automatically derive types from an existing table.

To Reproduce

  1. Create a set of Parquet files using a CREATE TABLE command in Athena.
  2. Attach a PostgreSQL instance to DuckDB.
  3. Run a COPY FROM command similar to the following:
    COPY <db>.<schema>.<table_name>
    FROM '{s3_location}' (FORMAT PARQUET);

Issue : There is no way to add union_by_name=True to the COPY FROM command, and this is already a COPY command.

OS:

linux

DuckDB Version:

1.0.0

DuckDB Client:

python

Full Name:

Arpit Aggarwal

Affiliation:

Candor Health

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a stable release

Did you include all relevant data sets for reproducing the issue?

No - I cannot share the data sets because they are confidential

Did you include all code required to reproduce the issue?

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

szarnyasg commented 2 weeks ago

Hi @arpit94, thanks for opening this issue. There are many moving parts here:

It seems to me that the Athena and Postgres are not required to reproduce the issue. Instead, the issue is that the following instruction works:

CREATE OR REPLACE TABLE tbl AS FROM read_parquet('https://duckdb.org/data/prices.parquet', union_by_name = true);

But this one doesn't:

COPY tbl FROM 'prices.parquet' (FORMAT PARQUET, UNION_BY_NAME true);
Not implemented Error: Unsupported option for COPY FROM parquet: UNION_BY_NAME

You can work around your issue by first using a CREATE [OR REPLACE] TABLE command to create an interim table, then manually copy the content of that table to your target Postgres table.