Closed lukapeschke closed 4 months ago
@PrettyWood I'm looking at the implem for column ranges right now, and I'm thinking about wether column names should be loaded and validated on sheet instatiation. However, this would have an impactt:
ExcelSheet
instantiation would become slightly slower and less efficient memory-wise, as data loading & validation would occur on instantiation. column_names
could become a sheet attributereader
owned ?)Any thoughts about this ?
Yeah this is what I wanted originally when I said "Would be nice to add to ExcelSheet the range of indices and list of columns. Currently we only have the width" I'd rather have a slight drop in performance-wise but better API and validation. In my mind once you load the excelsheet you could have the range of column indices and names. And if you select some they are directly validated at instantiation with better error messages (even with fuzzysearch to say "did you mean this column?" which is clearly not needed but why not haha) Could we try to go with this approach and see the perf impact?
For your 4th point I have no idea to be honest. Clearly lifetimes are a mess with pyo3 (cf the discussion we had on bytes)
@PrettyWood
import argparse
import fastexcel
def get_args() -> argparse.Namespace:
parser = argparse.ArgumentParser()
parser.add_argument("file")
parser.add_argument("-c", "--column", type=str, nargs="+", help="the columns to use")
return parser.parse_args()
def main():
args = get_args()
excel_file = fastexcel.read_excel(args.file)
use_columns = args.column or None
for sheet_name in excel_file.sheet_names:
excel_file.load_sheet_by_name(sheet_name, use_columns=use_columns).to_arrow()
if __name__ == "__main__":
main()
❯ time python3.12 test.py ~/Downloads/TestExcel.xlsx --column coucou
Traceback (most recent call last):
File "/home/lpeschke/work/toucan/fastexcel/test.py", line 24, in <module>
main()
File "/home/lpeschke/work/toucan/fastexcel/test.py", line 20, in main
excel_file.load_sheet_by_name(sheet_name, use_columns=use_columns).to_arrow()
File "/home/lpeschke/work/toucan/fastexcel/.direnv/python-3.12/lib64/python3.12/site-packages/fastexcel/__init__.py", line 65, in to_arrow
return self._sheet.to_arrow()
^^^^^^^^^^^^^^^^^^^^^^
_fastexcel.ColumnNotFoundError: column with name "coucou" not found
Context:
0: selected columns are invalid
1: could not create RecordBatch from sheet "INVOICES"
2: could not convert RecordBatch to pyarrow for sheet "INVOICES"
python3.12 test.py ~/Downloads/TestExcel.xlsx --column coucou 5.24s user 0.69s system 110% cpu 5.374 total
❯ time python3.12 test.py ~/Downloads/TestExcel.xlsx
python3.12 test.py ~/Downloads/TestExcel.xlsx 5.90s user 0.75s system 109% cpu 6.089 total
Traceback (most recent call last):
File "/home/lpeschke/work/toucan/fastexcel/test.py", line 23, in <module>
main()
File "/home/lpeschke/work/toucan/fastexcel/test.py", line 19, in main
excel_file.load_sheet_by_name(sheet_name, use_columns=use_columns).to_arrow()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lpeschke/work/toucan/fastexcel/.direnv/python-3.12/lib64/python3.12/site-packages/fastexcel/__init__.py", line 132, in load_sheet_by_name
self._reader.load_sheet_by_name(
_fastexcel.ColumnNotFoundError: column with name "coucou" not found
Context:
0: selected columns are invalid, available columns are: [<column list, elided>]
python3.12 test.py ~/Downloads/TestExcel.xlsx --column coucou 5.25s user 0.74s system 110% cpu 5.398 total
❯ time python3.12 test.py ~/Downloads/TestExcel.xlsx
python3.12 test.py ~/Downloads/TestExcel.xlsx 5.90s user 0.78s system 108% cpu 6.123 total
So the perf impact seems neglectible to me.
If the code LGTY, I'll expose the columns though an available_columns
property. Souds good to you ?
Perfect 🎉
Great, I'll do it later this afternoon then. Anything else I need to change ?
Nope I'll test the final PR tonight and will probably merge. Great work
closes #172 . Depends on #186