ToucanToco / fastexcel

A Python wrapper around calamine
http://fastexcel.toucantoco.dev/
MIT License
120 stars 6 forks source link

Fastexcel fails to load an excel file with polars #176

Closed durgeksh closed 9 months ago

durgeksh commented 9 months ago
dfs = pl.read_excel('input.xlsx', engine='calamine', sheet_id=0)
for key in dfs.keys():
    print(dfs[key].head())

Error:

  Traceback (most recent call last):
  File "/Users/Desktop/workspace/pocs/demo.py", line 27, in <module>
    dfs = pl.read_excel('compass_input.xlsx', engine='calamine', sheet_id=0)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/Desktop/workspace/pocs/.venv/lib/python3.11/site-packages/polars/utils/deprecation.py", line 136, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/Desktop/workspace/pocs/.venv/lib/python3.11/site-packages/polars/utils/deprecation.py", line 136, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/Desktop/workspace/pocs/.venv/lib/python3.11/site-packages/polars/io/spreadsheet/functions.py", line 259, in read_excel
    return _read_spreadsheet(
           ^^^^^^^^^^^^^^^^^^
  File "/Users/Desktop/workspace/pocs/.venv/lib/python3.11/site-packages/polars/io/spreadsheet/functions.py", line 487, in _read_spreadsheet
    parsed_sheets = {
                    ^
  File "/Users/Desktop/workspace/pocs/.venv/lib/python3.11/site-packages/polars/io/spreadsheet/functions.py", line 488, in <dictcomp>
    name: reader_fn(
          ^^^^^^^^^^
  File "/Users/Desktop/workspace/pocs/.venv/lib/python3.11/site-packages/polars/io/spreadsheet/functions.py", line 834, in _read_spreadsheet_calamine
    df = ws.to_polars()
         ^^^^^^^^^^^^^^
  File "/Users/Desktop/workspace/pocs/.venv/lib/python3.11/site-packages/fastexcel/__init__.py", line 64, in to_polars
    df = pl.from_arrow(data=self.to_arrow())
                            ^^^^^^^^^^^^^^^
  File "/Users/Desktop/workspace/pocs/.venv/lib/python3.11/site-packages/fastexcel/__init__.py", line 47, in to_arrow
    return self._sheet.to_arrow()
           ^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Could not create RecordBatch from sheet Hybrid_Productivity

Caused by:
    0: Could not build schema for sheet Hybrid_Productivity
    1: Error in calamine cell: NA

Installed versions:

--------Version info---------
Polars:               0.20.7
Index type:           UInt32
Platform:             macOS-12.7.3-arm64-arm-64bit
Python:               3.11.2 (v3.11.2:878ead1ac1, Feb  7 2023, 10:02:41) [Clang 13.0.0 (clang-1300.0.29.30)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           3.8.2
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.0
pyarrow:              15.0.0
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
None

Process finished with exit code 0
lukapeschke commented 9 months ago

Hello, could you please provide an excel file allowing to reproduce the bug ? Without this, it will be hard to investigate the issue

durgeksh commented 9 months ago

Hi @lukapeschke, I found that if a cell contains the value "#N/A" then it is failing to load the data. I am attaching a sample below. Thank you. sample_data.xlsx

df = pl.read_excel('sample_data.xlsx', engine='calamine')
durgeksh commented 9 months ago

Thank you so much for so quick fix @PrettyWood @lukapeschke.