Closed PrettyWood closed 9 months ago
import argparse
import fastexcel
def get_args() -> argparse.Namespace:
parser = argparse.ArgumentParser()
parser.add_argument("file")
parser.add_argument("--bytes", action='store_true')
return parser.parse_args()
def main():
args = get_args()
if args.bytes:
with open(args.file, "rb") as f:
excel_file = fastexcel.read_excel(f.read())
else:
excel_file = fastexcel.read_excel(args.file)
for sheet_name in excel_file.sheet_names:
excel_file.load_sheet_by_name(sheet_name).to_arrow()
if __name__ == "__main__":
main()
file | bytes | |
---|---|---|
&[u8] | ||
Vec |
file | bytes | |
---|---|---|
&[u8] | ||
Vec |
It seems to clone the data @lukapeschke. I can revert last commit but don't have a direct solution in my mind to avoid the leak
. I may miss something
Indeed, seems like &[u8]
would be more memory-efficient. Unfortunately, I ran some tests with the following script (5 and 10 iterations), and it seems like the leak
is really leaking :confused: So I think we will have to keep the owned types:
import argparse
import fastexcel
import time
import gc
def get_args() -> argparse.Namespace:
parser = argparse.ArgumentParser()
parser.add_argument("file")
parser.add_argument("-c", "--column", type=str, nargs="+", help="the columns to use")
return parser.parse_args()
def main():
args = get_args()
use_columns = args.column or None
for _ in range(10):
with open(args.file, "rb") as fd:
content = fd.read()
excel_file = fastexcel.read_excel(content)
sheet = excel_file.load_sheet_by_name(excel_file.sheet_names[0], use_columns=use_columns)
arrow_data = sheet.to_arrow()
del excel_file
del content
del sheet
del arrow_data
gc.collect()
time.sleep(0.25)
if __name__ == "__main__":
main()
5 iterations | 10 iterations | |
---|---|---|
HEAD (owned) |
||
HEAD^ (slice) |
Yes this is why I didn't revert but I'm not happy with it. The read_excel
is way slower with bytes (with Vec<u8>
) than with a file path and the loading is almost the same.
Of course if the content is small it's ok but for a big content the diff is really significant.
Ok I have something :) With last commit
Haha I was just benchmarking something similar :smile:
Looks good!!
Yep even better! The trick was just to avoid downcasting to a vec via pyo3 and let to_vec
do the magic
closes #162
Nice!
@PrettyWood Finally integrated this on our side: https://github.com/pola-rs/polars/pull/16344 Thanks! 👍
closes #162