Performance impact

As expected, this has a small performance impact time-wise. However, the cost seems acceptable to me. If it seems unacceptable to others, I'm open for a refactoring to make this optional.

The good news is that there is no impact memory-wise. However, iterating over all data in the sheet to build the schema appears to cost some time (note the flat memory usage that appears in the charts just after the first usage spike).

Some ideas that could help to mitigate this:

Allowing the dtypes to be specified (as suggested in #158), which would allow us to skip the dtype guessing step entirely
Allowing to specify a maximal number of rows to sample for dtype guessing (for example, on my 280k rows spreadsheet, reading the first 1000 would be sufficient)

Benchmarks

Using the following script, here are the results of benchmarks on my machine (sheet of 41 columns x ~280k rows):

import argparse

import fastexcel

def get_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser()
    parser.add_argument("file")
    return parser.parse_args()

def main():
    args = get_args()
    excel_file = fastexcel.read_excel(args.file)
    for sheet_name in excel_file.sheet_names:
        excel_file.load_sheet_by_name(sheet_name).to_arrow()

if __name__ == "__main__":
    main()

Before

before

After

after

After, without the tweak in `create_string_array`

without_match

lukapeschke commented 8 months ago

cc @ldacey @deanm0000 @alexander-beedie in case you want to have a look :)

ldacey commented 8 months ago

Nice - yeah, being able to cast the type ahead of time or change the number of rows used for inference could be useful. I have raised an issue for polars before because a column which looked like an integer actually had some text values beyond the default row inference. In this case it was the 622nd row, but you can imagine a large file that has been sorted where all values with letters or pure numbers are grouped at the bottom of the file.

import polars as pl
import tempfile

case_numbers = [f"{i:06d}" for i in range(1, 1001)]
test = ["test" for _ in range(1, 1001)]

df = pl.DataFrame({'test': test, 'case_number': case_numbers})

df[622, "case_number"] = "CASE-NO-0A60"

with tempfile.NamedTemporaryFile(delete=True, suffix=".csv") as temp_file:
    filename = temp_file.name
    df.write_csv(filename)
    pl.read_csv(filename, n_rows=1, infer_schema_length=623)

PrettyWood commented 8 months ago

I'll review when I'm back from holidays @lukapeschke Super glad fastexcel has been added to polars ecosystem as we first planned ;) Great job @lukapeschke @alexander-beedie

lukapeschke commented 8 months ago

@PrettyWood sure no worries :slightly_smiling_face: I've released 0.8.0 in the meantime so that people can try out the windows wheels

lukapeschke commented 7 months ago

@PrettyWood last changes add:

Merge with main (calamine 0.24)
Added support for bools for multi-dtype columns
Adapted to recent calamine changes (stuff related to ExcelDateTime)
Added schema_sample_rows param

ToucanToco / fastexcel

feat(excelsheet): add support for multi-dtype columns #164

Performance impact

Benchmarks

Before

After

After, without the tweak in `create_string_array`

ToucanToco / fastexcel

feat(excelsheet): add support for multi-dtype columns #164

Performance impact

Benchmarks

Before

After

After, without the tweak in create_string_array

After, without the tweak in `create_string_array`