ToucanToco / fastexcel

A Python wrapper around calamine
http://fastexcel.toucantoco.dev/
MIT License
121 stars 6 forks source link

feat: read excel from bytes content #192

Closed PrettyWood closed 9 months ago

PrettyWood commented 9 months ago

closes #162

PrettyWood commented 9 months ago
import argparse

import fastexcel

def get_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser()
    parser.add_argument("file")
    parser.add_argument("--bytes", action='store_true')
    return parser.parse_args()

def main():
    args = get_args()
    if args.bytes:
        with open(args.file, "rb") as f:
            excel_file = fastexcel.read_excel(f.read())
    else:
        excel_file = fastexcel.read_excel(args.file)

    for sheet_name in excel_file.sheet_names:
        excel_file.load_sheet_by_name(sheet_name).to_arrow()

if __name__ == "__main__":
    main()

File 1

file bytes
&[u8] slice-file slice-bytes
Vec vec-file vec-bytes

File 2

file bytes
&[u8] formulas-slice-file formulas-slice-bytes
Vec formulas-vec-file formulas-vec-bytes

It seems to clone the data @lukapeschke. I can revert last commit but don't have a direct solution in my mind to avoid the leak. I may miss something

lukapeschke commented 9 months ago

Indeed, seems like &[u8] would be more memory-efficient. Unfortunately, I ran some tests with the following script (5 and 10 iterations), and it seems like the leak is really leaking :confused: So I think we will have to keep the owned types:

import argparse

import fastexcel
import time
import gc

def get_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser()
    parser.add_argument("file")
    parser.add_argument("-c", "--column", type=str, nargs="+", help="the columns to use")
    return parser.parse_args()

def main():
    args = get_args()
    use_columns = args.column or None

    for _ in range(10):
        with open(args.file, "rb") as fd:
            content = fd.read()
            excel_file = fastexcel.read_excel(content)

        sheet = excel_file.load_sheet_by_name(excel_file.sheet_names[0], use_columns=use_columns)
        arrow_data = sheet.to_arrow()

        del excel_file
        del content
        del sheet
        del arrow_data
        gc.collect()
        time.sleep(0.25)

if __name__ == "__main__":
    main()
5 iterations 10 iterations
HEAD (owned) vec vec_10
HEAD^ (slice) slice slice_10
PrettyWood commented 9 months ago

Yes this is why I didn't revert but I'm not happy with it. The read_excel is way slower with bytes (with Vec<u8>) than with a file path and the loading is almost the same. Of course if the content is small it's ok but for a big content the diff is really significant.

PrettyWood commented 9 months ago

Ok I have something :) With last commit

BEFORE

before

AFTER

after

lukapeschke commented 9 months ago

Haha I was just benchmarking something similar :smile: image

Looks good!!

PrettyWood commented 9 months ago

Yep even better! The trick was just to avoid downcasting to a vec via pyo3 and let to_vec do the magic

alexander-beedie commented 8 months ago

closes #162

Nice!

alexander-beedie commented 6 months ago

@PrettyWood Finally integrated this on our side: https://github.com/pola-rs/polars/pull/16344 Thanks! 👍