ToucanToco / fastexcel

A Python wrapper around calamine
http://fastexcel.toucantoco.dev/
MIT License
82 stars 4 forks source link

Improve schema inference - wrongly assigns null dtype and misses reading valid data at end of large file #208

Closed SanRepo closed 2 months ago

SanRepo commented 3 months ago

Team,

Trying to read large xlsx file ( around 500k rows with 8 columns ), was using Pandas with calamine engine before, all good and got to know about polars with fastexcel [calamine], tried to adopt it and found this bug.

Observation:

During reading above large xlsx file with 1st column having valid value only at 143,407th row on overall 500k records xlsx file, I observe two issues:

  1. the 1st column is inferred with dtype of "null" as in below snap:

image

  1. the 1st column is left empty in the csv file, even with valid values present from 143,407th row onwards in xlsx file. Validated this when i tested data by writing the read excel dataframe into csv.

Code details:

  1. Polars code

import polars as pl df = pl.read_excel(source = <<path of test.xlsx (attached)>>, sheet_id=1,engine="calamine", raise_if_empty = False print(df.sample(10))

  1. Pandas code

import pandas as pd df = pd.read_excel(path, sheet_name=0, engine="calamine", dtype=str,na_filter=False)

Step to reproduce the issue:

Have the attached test.xlsx file with 1st column all empty/null and run above polars code. It prints 1st column inferred data type as null.

Please refer to pandas code above, that works for me without any data loss.

Technical Feature requests to resolve the issue:

1, ExcelReader option to read xlsx fields "as-is" / as Strings [without schema inference] 2, ExcelReader option to control Schema Inference row limit [ Hope it will help other use cases with less records, but not in my use case ]

System/lib details: Windows OS , Python 3.11.5, polars==0.20.16, fastexcel==0.9.1,pandas==2.2.1,python-calamine==0.2.0

test.xlsx

lukapeschke commented 3 months ago

Hello, this should be partially fixed by #195, which allows to specify dtypes. It will be part of 0.10.0, which we intend to release soon. In the meantime, maybe you could check with the master version of fastexcel to see if it solves your issue ?

Regarding your file, I'm not sure what Dtype you expect for Source.Name ? When opening the file with libreoffice, the column seems to be empty image

SanRepo commented 3 months ago

Hi @lukapeschke,

Thank you for looking into this issue.

Regarding the file:

I gave the test.xlsx file with empty 1st column, to show the observation of dtype inferred as null.

image

Regarding Technical Feature requests:

Thank you for the reference to use dtypes parameter, i tried it and the observation is:

polars library is applying schema_override after xlsx data is read by fastexcel as in below code:

 def _read_spreadsheet_calamine(
    parser: Any,
    sheet_name: str | None,
    read_options: dict[str, Any],
    schema_overrides: SchemaDict | None,
    *,
    raise_if_empty: bool,
) -> pl.DataFrame:
    ws = parser.load_sheet_by_name(sheet_name, **read_options)
    df = ws.to_polars()

    if schema_overrides:
        df = df.cast(dtypes=schema_overrides)

And i do not see read_options override to pass dtypes to fastexcel ExcelReader class as in below code :

class _ExcelReader:
    """A class representing an open Excel file and allowing to read its sheets"""

    def load_sheet_by_name(
        self,
        name: str,
        *,
        header_row: int | None = 0,
        column_names: list[str] | None = None,
        skip_rows: int = 0,
        n_rows: int | None = None,
        schema_sample_rows: int | None = 1_000,
    ) -> _ExcelSheet: ...

Hence, If we could have an option to read all xlsx fields "as-is" or as Strings [without schema inference] , it will be great!

lukapeschke commented 3 months ago

I gave the test.xlsx file with empty 1st column, to show the observation of dtype inferred as null.

But in that case, the inferred null dtype is correct, right ?

And i do not see read_options override to pass dtypes to fastexcel ExcelReader class as in below code :

In your case, it should be possible to use dtypes though polars's read_options.

With the file you provided, this would work as follows (requires fastexcel 0.10.0`:

import polars as pl

df = pl.read_excel('issue_208.xlsx', engine='calamine', read_options={'dtypes': {'Source.Name': 'string', 'Date': 'string'}})

This results in the following dataframe (types for columns which are not specified in dtypes still get inferred):

┌─────────────┬──────────┬──────────┬───────────┬───────────────┬────────────┬─────────────────────┬───────┐
│ Source.Name ┆ Provider ┆ LastName ┆ FirstName ┆ Email         ┆ ModuleName ┆ Date                ┆ Score │
│ ---         ┆ ---      ┆ ---      ┆ ---       ┆ ---           ┆ ---        ┆ ---                 ┆ ---   │
│ str         ┆ str      ┆ str      ┆ str       ┆ str           ┆ str        ┆ str                 ┆ f64   │
╞═════════════╪══════════╪══════════╪═══════════╪═══════════════╪════════════╪═════════════════════╪═══════╡
│ null        ┆ Allstate ┆ TestLN   ┆ TestFN    ┆ test@test.com ┆ TestModule ┆ 2022-08-31 16:32:00 ┆ 0.8   │
│ null        ┆ ANW      ┆ TestLN   ┆ TestFN    ┆ test@test.com ┆ TestModule ┆ 2023-01-03 00:00:00 ┆ 0.7   │
│ null        ┆ Allstate ┆ TestLN   ┆ TestFN    ┆ test@test.com ┆ TestModule ┆ 2022-01-21 15:20:34 ┆ 0.9   │
│ null        ┆ Allstate ┆ TestLN   ┆ TestFN    ┆ test@test.com ┆ TestModule ┆ 2022-02-07 12:18:36 ┆ 0.9   │
│ null        ┆ Allstate ┆ TestLN   ┆ TestFN    ┆ test@test.com ┆ TestModule ┆ 2022-07-08 11:58:49 ┆ 0.7   │
│ …           ┆ …        ┆ …        ┆ …         ┆ …             ┆ …          ┆ …                   ┆ …     │
│ null        ┆ Barclays ┆ TestLN   ┆ TestFN    ┆ test@test.com ┆ TestModule ┆ 2022-01-07 11:41:41 ┆ 0.9   │
│ null        ┆ Barclays ┆ TestLN   ┆ TestFN    ┆ test@test.com ┆ TestModule ┆ 2022-01-10 09:30:04 ┆ 0.8   │
│ null        ┆ Barclays ┆ TestLN   ┆ TestFN    ┆ test@test.com ┆ TestModule ┆ 2022-01-10 09:38:39 ┆ 0.75  │
│ null        ┆ Barclays ┆ TestLN   ┆ TestFN    ┆ test@test.com ┆ TestModule ┆ 2022-01-10 11:21:31 ┆ 0.9   │
│ null        ┆ Barclays ┆ TestLN   ┆ TestFN    ┆ test@test.com ┆ TestModule ┆ 2022-01-10 11:42:59 ┆ 0.9   │
└─────────────┴──────────┴──────────┴───────────┴───────────────┴────────────┴─────────────────────┴───────┘
SanRepo commented 3 months ago

with fastexcel 0.10.0, now able to specify data type as "string" for all columns in xlsx file, and read the data "as-is" without data loss. 👍

Details on steps followed to test this case :

1, Use the attached test xlsx file, Get the column count by reading first 2 lines: `df = pl.read_excel(source = path, sheet_id=1,engine="calamine", raise_if_empty = False, read_options = {'n_rows' : 2} )` 2, Prepare the DTypeMap with data type as "string" for all columns, using integer indexing: ``` v_col_dtypes = {} i = 0 while i < len(df.columns): v_col_dtypes[i] = 'string' i= i+1 print(v_col_dtypes) ``` 3, Used dtypes spec framed in Step 2 above and able to read data without any loss : `df = pl.read_excel(source = path, sheet_id=1,engine="calamine", raise_if_empty = False, read_options = {'dtypes' : v_col_dtypes } )` >

Below is the example code to read xlsx file with no dtypes specified, In this case, 1st column is inferred as null and valid data at end of the file is ignored / lost.

Details on steps followed to test this case :

1, Use the attached test xlsx file, read it as dataframe: `df = pl.read_excel(source = path, sheet_id=1,engine="calamine", raise_if_empty = False)` 2, Write the dataframe as csv and validate data [1st column is all empty in output csv file but test.xlsx file has valid data for 1st column at last couple of rows]: `df.write_csv(<>)` >

To summarize, now able to read data without loss with dtypes specification, but will let you decide how to handle the case when default inference results in data loss. Also, let me know whether you want to keep this Issue ticket open until the default case is fixed or you want to track it separate.

test.xlsx

lukapeschke commented 2 months ago

Hello @SanRepo thanks for the update! If you do not wish to specify dtypes, you can use the schema_sample_rows parameters, which indicates on how many rows dtype detection should be done. I'm closing this for now since the original issue appears to be solved, but feel free to open another issue in case you still have a problem that can't be adressed with one of these parameters