lukapeschke commented 9 months ago

What

This introduces eager loading functions that make use of the calamine's new DataTypeRef.

This prevents some allocations, resulting in a lower memory footprint.

Caveats

The API is kinda rough for now, it will probably need some cleaning (I mostly wanted to check if the memory gain was interesting here).
The functions need to be eager because DataTypeRef has an explicit lifetime, which is not allowed by PyO3 (lifetimes are hard to enforce on the python side: https://pyo3.rs/v0.20.0/class.html#no-lifetime-parameters)
In order for this to work, some changes are needed in calamine, and we don't know if this is something the library maintainers had in mind. PR and discussion: https://github.com/tafia/calamine/pull/390

Gains

While the speed stays roughly the same (it was even 3~5% faster on my machine on several tests), the memory footprint decreases by almost 25%. . This means that we're almost as good as pandas memory-wise :partying_face: (they still beat us by a few MBs), while being about 10 times faster

Before

before

After

after

Pandas

pandas

lukapeschke commented 7 months ago

Some work is still required in calamine: https://github.com/tafia/calamine/pull/409

lukapeschke commented 7 months ago

Okay well just noticed that the API changed so we actually need to use workshet_range_ref in case Sheets are the Xlsx variant

PrettyWood commented 7 months ago

Glad to see https://github.com/tafia/calamine/pull/409 has been merged. Hopefully we get a new release soon 👍

lukapeschke commented 7 months ago

new data

`main`

import argparse
from time import sleep
import fastexcel

def get_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser()
    parser.add_argument("file")
    parser.add_argument("-c", "--column", type=str, nargs="+", help="the columns to use")
    return parser.parse_args()

def main():
    args = get_args()
    excel_file = fastexcel.read_excel(args.file)
    use_columns = args.column or None

    for sheet_name in excel_file.sheet_names:
        arrow_data = excel_file.load_sheet_by_name(sheet_name, use_columns=use_columns).to_arrow()
        # sleeping to be really visible on the resulting graph
        sleep(1)
        arrow_data.to_pandas()

if __name__ == "__main__":
    main()

main

this branch

import argparse
from time import sleep
import fastexcel

def get_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser()
    parser.add_argument("file")
    parser.add_argument("-c", "--column", type=str, nargs="+", help="the columns to use")
    return parser.parse_args()

def main():
    args = get_args()
    excel_file = fastexcel.read_excel(args.file)
    use_columns = args.column or None

    for sheet_name in excel_file.sheet_names:
        arrow_data = excel_file.load_sheet_eager(sheet_name)
        # sleeping to be really visible on the resulting graph
        sleep(1)
        arrow_data.to_pandas()

if __name__ == "__main__":
    main()

branch

PrettyWood commented 7 months ago

New benchmark looks great 😃

lukapeschke commented 7 months ago

Good news, looks like we should be able to have lazy-by-ref once a new calamine version is out :partying_face:

Benchmarks with the latest version:

iterations	owned	by ref
1
20

lukapeschke commented 4 months ago

calamine 0.25.0 should be released soon, meaning I should finally be able to finish this :slightly_smiling_face: https://github.com/tafia/calamine/issues/435

lukapeschke commented 3 months ago