feat: read excel from bytes content

ToucanToco / fastexcel

A Python wrapper around calamine

http://fastexcel.toucantoco.dev/

MIT License

121 stars 6 forks source link

feat: read excel from bytes content #192

Closed PrettyWood closed 9 months ago

PrettyWood commented 9 months ago

closes #162

PrettyWood commented 9 months ago

import argparse

import fastexcel

def get_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser()
    parser.add_argument("file")
    parser.add_argument("--bytes", action='store_true')
    return parser.parse_args()

def main():
    args = get_args()
    if args.bytes:
        with open(args.file, "rb") as f:
            excel_file = fastexcel.read_excel(f.read())
    else:
        excel_file = fastexcel.read_excel(args.file)

    for sheet_name in excel_file.sheet_names:
        excel_file.load_sheet_by_name(sheet_name).to_arrow()

if __name__ == "__main__":
    main()

File 1

	file	bytes
&[u8]
Vec

File 2

	file	bytes
&[u8]
Vec

It seems to clone the data @lukapeschke. I can revert last commit but don't have a direct solution in my mind to avoid the leak. I may miss something

lukapeschke commented 9 months ago

Indeed, seems like &[u8] would be more memory-efficient. Unfortunately, I ran some tests with the following script (5 and 10 iterations), and it seems like the leak is really leaking :confused: So I think we will have to keep the owned types:

import argparse

import fastexcel
import time
import gc

def get_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser()
    parser.add_argument("file")
    parser.add_argument("-c", "--column", type=str, nargs="+", help="the columns to use")
    return parser.parse_args()

def main():
    args = get_args()
    use_columns = args.column or None

    for _ in range(10):
        with open(args.file, "rb") as fd:
            content = fd.read()
            excel_file = fastexcel.read_excel(content)

        sheet = excel_file.load_sheet_by_name(excel_file.sheet_names[0], use_columns=use_columns)
        arrow_data = sheet.to_arrow()

        del excel_file
        del content
        del sheet
        del arrow_data
        gc.collect()
        time.sleep(0.25)

if __name__ == "__main__":
    main()

5 iterations	10 iterations
`HEAD` (owned)
`HEAD^` (slice)

PrettyWood commented 9 months ago

Yes this is why I didn't revert but I'm not happy with it. The read_excel is way slower with bytes (with Vec<u8>) than with a file path and the loading is almost the same. Of course if the content is small it's ok but for a big content the diff is really significant.

PrettyWood commented 9 months ago

Ok I have something :) With last commit