ToucanToco / fastexcel

A Python wrapper around calamine
http://fastexcel.toucantoco.dev/
MIT License
102 stars 6 forks source link

feat(excelsheet): add support for multi-dtype columns #164

Closed lukapeschke closed 7 months ago

lukapeschke commented 8 months ago

closes #160

lukapeschke commented 8 months ago

Performance impact

As expected, this has a small performance impact time-wise. However, the cost seems acceptable to me. If it seems unacceptable to others, I'm open for a refactoring to make this optional.

The good news is that there is no impact memory-wise. However, iterating over all data in the sheet to build the schema appears to cost some time (note the flat memory usage that appears in the charts just after the first usage spike).

Some ideas that could help to mitigate this:

Benchmarks

Using the following script, here are the results of benchmarks on my machine (sheet of 41 columns x ~280k rows):

import argparse

import fastexcel

def get_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser()
    parser.add_argument("file")
    return parser.parse_args()

def main():
    args = get_args()
    excel_file = fastexcel.read_excel(args.file)
    for sheet_name in excel_file.sheet_names:
        excel_file.load_sheet_by_name(sheet_name).to_arrow()

if __name__ == "__main__":
    main()

Before

before

After

after

After, without the tweak in create_string_array

without_match

lukapeschke commented 8 months ago

cc @ldacey @deanm0000 @alexander-beedie in case you want to have a look :)

ldacey commented 8 months ago

Nice - yeah, being able to cast the type ahead of time or change the number of rows used for inference could be useful. I have raised an issue for polars before because a column which looked like an integer actually had some text values beyond the default row inference. In this case it was the 622nd row, but you can imagine a large file that has been sorted where all values with letters or pure numbers are grouped at the bottom of the file.

import polars as pl
import tempfile

case_numbers = [f"{i:06d}" for i in range(1, 1001)]
test = ["test" for _ in range(1, 1001)]

df = pl.DataFrame({'test': test, 'case_number': case_numbers})

df[622, "case_number"] = "CASE-NO-0A60"

with tempfile.NamedTemporaryFile(delete=True, suffix=".csv") as temp_file:
    filename = temp_file.name
    df.write_csv(filename)
    pl.read_csv(filename, n_rows=1, infer_schema_length=623)
PrettyWood commented 8 months ago

I'll review when I'm back from holidays @lukapeschke Super glad fastexcel has been added to polars ecosystem as we first planned ;) Great job @lukapeschke @alexander-beedie

lukapeschke commented 8 months ago

@PrettyWood sure no worries :slightly_smiling_face: I've released 0.8.0 in the meantime so that people can try out the windows wheels

lukapeschke commented 7 months ago

@PrettyWood last changes add: