duckdb / duckdb_spatial

MIT License
461 stars 34 forks source link

EXCEL File corrupt after exporting it to S3 #344

Open turnitoffanonagain opened 3 months ago

turnitoffanonagain commented 3 months ago

What happens?

It seems like Excel files cannot be exported correctly to S3 using the spatial extension. It's important to mention, that it actually works for most systems however when opening the excel within official microsoft excel i get a warning, that some data from the field had to be removed. It actually works perfectly if i export the same data to my local disk or open the s3 version of the file in google sheets.

To Reproduce

Just export an excel to S3 download it and open it with the OFFICIAL microsoft excel.

SET s3_access_key_id="";
SET s3_session_token="";
SET s3_secret_access_key="";
INSTALL SPATIAL;
INSTALL HTTPFS;
LOAD SPATIAL;
LOAD HTTPFS;
CREATE TABLE items (
          id INTEGER,
          name VARCHAR,
          price DOUBLE
      );
INSERT INTO items VALUES
      (1, 'Laptop', 999.99),
      (2, 'Mouse', 25.50),
      (3, 'Keyboard', 45.00);
BEGIN TRANSACTION;
COPY (SELECT * FROM items) TO 's3://testbucket/Test.xlsx' WITH (FORMAT GDAL, DRIVER 'xlsx');
COMMIT;

OS:

x86

DuckDB Version:

1.0

DuckDB Client:

Python

Full Name:

Lukas Raepple

Affiliation:

Private Person

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a stable release

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include all code required to reproduce the issue?

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

turnitoffanonagain commented 3 months ago

Some additional information: When unzipping the excel file and comparing the same export for local disk and s3 upload all files are the exact same. The only difference is, that the downloaded file from s3 is a little bit bigger in size before unzipping. Also setting any type of content type on the s3 object before downloading didn't have any effect. I would highly appreciate if someone could at least confirm this bug.

turnitoffanonagain commented 3 months ago

Ok, now i'm almost certain it's a problem in S3 upload functionality of spatial extension. First i was confused, because S3 Upload for CSV and parquet works perfectly with duckdb and at least parquet is binary as well. However as far as i understand the code when exporting excel it's not using the httpfs plugin to export to s3, but the spatial extension's capabilities. I created a PoC: When exporting the file to local disk using a duckdb and then uploading it via boto3 to s3 the excel export is fine. But when uploading it directly with duckdb it's corrupted.