ToucanToco / fastexcel

A Python wrapper around calamine
http://fastexcel.toucantoco.dev/
MIT License
102 stars 6 forks source link

feat: introduce eager loading functions #147

Closed lukapeschke closed 3 months ago

lukapeschke commented 9 months ago

What

This introduces eager loading functions that make use of the calamine's new DataTypeRef.

This prevents some allocations, resulting in a lower memory footprint.

Caveats

Gains

While the speed stays roughly the same (it was even 3~5% faster on my machine on several tests), the memory footprint decreases by almost 25%. . This means that we're almost as good as pandas memory-wise :partying_face: (they still beat us by a few MBs), while being about 10 times faster

Before

before

After

after

Pandas

pandas

lukapeschke commented 7 months ago

Some work is still required in calamine: https://github.com/tafia/calamine/pull/409

lukapeschke commented 7 months ago

Okay well just noticed that the API changed so we actually need to use workshet_range_ref in case Sheets are the Xlsx variant

PrettyWood commented 7 months ago

Glad to see https://github.com/tafia/calamine/pull/409 has been merged. Hopefully we get a new release soon 👍

lukapeschke commented 7 months ago

new data

main

import argparse
from time import sleep
import fastexcel

def get_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser()
    parser.add_argument("file")
    parser.add_argument("-c", "--column", type=str, nargs="+", help="the columns to use")
    return parser.parse_args()

def main():
    args = get_args()
    excel_file = fastexcel.read_excel(args.file)
    use_columns = args.column or None

    for sheet_name in excel_file.sheet_names:
        arrow_data = excel_file.load_sheet_by_name(sheet_name, use_columns=use_columns).to_arrow()
        # sleeping to be really visible on the resulting graph
        sleep(1)
        arrow_data.to_pandas()

if __name__ == "__main__":
    main()

main

this branch

import argparse
from time import sleep
import fastexcel

def get_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser()
    parser.add_argument("file")
    parser.add_argument("-c", "--column", type=str, nargs="+", help="the columns to use")
    return parser.parse_args()

def main():
    args = get_args()
    excel_file = fastexcel.read_excel(args.file)
    use_columns = args.column or None

    for sheet_name in excel_file.sheet_names:
        arrow_data = excel_file.load_sheet_eager(sheet_name)
        # sleeping to be really visible on the resulting graph
        sleep(1)
        arrow_data.to_pandas()

if __name__ == "__main__":
    main()

branch

PrettyWood commented 7 months ago

New benchmark looks great 😃

lukapeschke commented 7 months ago

Good news, looks like we should be able to have lazy-by-ref once a new calamine version is out :partying_face:

Benchmarks with the latest version:

iterations owned by ref
1 lazy eager
20 lazy_20 eager_20
lukapeschke commented 4 months ago

calamine 0.25.0 should be released soon, meaning I should finally be able to finish this :slightly_smiling_face: https://github.com/tafia/calamine/issues/435

lukapeschke commented 3 months ago

latest measurements with this branch

iterations master this branch (lazy) this branch (eager)
1 master_1 lazy_1 eager_1
20 master_20 lazy_20 eager_20