duckdb / duckdb_spatial

MIT License
474 stars 35 forks source link

Read xlsx file using 1st row as header even using "open_options=['HEADERS=DISABLE', 'FIELD_TYPES=STRING'])" #323

Open michaelpopes opened 4 months ago

michaelpopes commented 4 months ago

DUCKDB_READ_XLSX_BUG.xlsx

As you can see, there are two sheets in this Excel file. I wish to read all the cells as data, so I use this option "open_options=['HEADERS=DISABLE', 'FIELD_TYPES=STRING'])".

import duckdb

duckdb.install_extension("spatial")
duckdb.load_extension("spatial")

duckdb.sql(f"FROM st_read('DUCKDB_READ_XLSX_BUG.xlsx', layer='2', open_options=['HEADERS=DISABLE', 'FIELD_TYPES=STRING'])").show(max_width=198)
duckdb.sql(f"FROM st_read('DUCKDB_READ_XLSX_BUG.xlsx', layer='3', open_options=['HEADERS=DISABLE', 'FIELD_TYPES=STRING'])").show(max_width=198)

The results are different for the two sheets. The result of sheet "2" diables the header, which is correct. But the result of sheet "3" still uses 1st row as head, which is wrong. I don't see any difference between sheet "2" and sheet "3" to cause the different results.

Python 3.12.3 (tags/v3.12.3:f6650f9, Apr  9 2024, 14:05:25) [MSC v.1938 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import duckdb
>>> duckdb.sql(f"SELECT VERSION()").show(max_width=198)
┌───────────┐
│ version() │
│  varchar  │
├───────────┤
│ v0.10.3   │
└───────────┘

>>>
>>> duckdb.install_extension("spatial")
>>> duckdb.load_extension("spatial")
>>>
>>> duckdb.sql(f"FROM st_read('DUCKDB_READ_XLSX_BUG.xlsx', layer='2', open_options=['HEADERS=DISABLE', 'FIELD_TYPES=STRING'])").show(max_width=198)
┌─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ Field1  │ Field2  │ Field3  │ Field4  │ Field5  │ Field6  │ Field7  │ Field8  │ Field9  │ Field10 │ Field11 │ Field12 │ Field13 │ Field14 │
│ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │
├─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ AC      │ BC      │ CC      │ DC      │ EC      │ FC      │ GC      │ HC      │ IC      │ JC      │ KC      │ LC      │ MC      │ NC      │
│ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │
└─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘

>>> duckdb.sql(f"FROM st_read('DUCKDB_READ_XLSX_BUG.xlsx', layer='3', open_options=['HEADERS=DISABLE', 'FIELD_TYPES=STRING'])").show(max_width=198)
┌─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│   AC    │   BC    │   CC    │   DC    │   EC    │   FC    │   GC    │   HC    │   IC    │   JC    │   KC    │   LC    │   MC    │   NC    │   OC    │
│ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │
├─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │
│ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │
└─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘

As you can see from the results, I am using python 3.12.3 and Duckdb v0.10.3.

The correct results should be that both sheet "2" and sheet "3" have the same head, like "Field1", "Field2", etc.

Thank you for your consideration!

iku000888 commented 4 days ago

I ran into a very similar issue attempting to read a file with multiple sheets, where the first sheet respects HEADERS=DISABLE but not for the second sheet.

D .version
SQLite v1.1.1 af39bd0dcf
clang-15.0.0
iku000888 commented 4 days ago

I seem to be able to work around it by exporting the sheet of interest to a single sheeted xlsx file and then reading the exported xlsx. Not great but I can possibly live with it.

D COPY (select * from st_read('file.xlsx', layer = 'sheet')) TO 'tempoutput.xlsx' WITH (FORMAT GDAL, DRIVER 'xlsx');