Closed lukapeschke closed 3 months ago
Some work is still required in calamine: https://github.com/tafia/calamine/pull/409
Okay well just noticed that the API changed so we actually need to use workshet_range_ref
in case Sheets
are the Xlsx
variant
Glad to see https://github.com/tafia/calamine/pull/409 has been merged. Hopefully we get a new release soon 👍
main
import argparse
from time import sleep
import fastexcel
def get_args() -> argparse.Namespace:
parser = argparse.ArgumentParser()
parser.add_argument("file")
parser.add_argument("-c", "--column", type=str, nargs="+", help="the columns to use")
return parser.parse_args()
def main():
args = get_args()
excel_file = fastexcel.read_excel(args.file)
use_columns = args.column or None
for sheet_name in excel_file.sheet_names:
arrow_data = excel_file.load_sheet_by_name(sheet_name, use_columns=use_columns).to_arrow()
# sleeping to be really visible on the resulting graph
sleep(1)
arrow_data.to_pandas()
if __name__ == "__main__":
main()
import argparse
from time import sleep
import fastexcel
def get_args() -> argparse.Namespace:
parser = argparse.ArgumentParser()
parser.add_argument("file")
parser.add_argument("-c", "--column", type=str, nargs="+", help="the columns to use")
return parser.parse_args()
def main():
args = get_args()
excel_file = fastexcel.read_excel(args.file)
use_columns = args.column or None
for sheet_name in excel_file.sheet_names:
arrow_data = excel_file.load_sheet_eager(sheet_name)
# sleeping to be really visible on the resulting graph
sleep(1)
arrow_data.to_pandas()
if __name__ == "__main__":
main()
New benchmark looks great 😃
Good news, looks like we should be able to have lazy-by-ref once a new calamine version is out :partying_face:
Benchmarks with the latest version:
iterations | owned | by ref |
---|---|---|
1 | ||
20 |
calamine 0.25.0 should be released soon, meaning I should finally be able to finish this :slightly_smiling_face: https://github.com/tafia/calamine/issues/435
iterations | master | this branch (lazy) | this branch (eager) |
---|---|---|---|
1 | |||
20 |
What
This introduces eager loading functions that make use of the calamine's new
DataTypeRef
.This prevents some allocations, resulting in a lower memory footprint.
Caveats
The API is kinda rough for now, it will probably need some cleaning (I mostly wanted to check if the memory gain was interesting here).
The functions need to be eager because
DataTypeRef
has an explicit lifetime, which is not allowed by PyO3 (lifetimes are hard to enforce on the python side: https://pyo3.rs/v0.20.0/class.html#no-lifetime-parameters)In order for this to work, some changes are needed in calamine, and we don't know if this is something the library maintainers had in mind. PR and discussion: https://github.com/tafia/calamine/pull/390
Gains
While the speed stays roughly the same (it was even 3~5% faster on my machine on several tests), the memory footprint decreases by almost 25%. . This means that we're almost as good as pandas memory-wise :partying_face: (they still beat us by a few MBs), while being about 10 times faster
Before
After
Pandas