lukapeschke commented 4 months ago

closes #172 . Depends on #186

lukapeschke commented 4 months ago

@PrettyWood I'm looking at the implem for column ranges right now, and I'm thinking about wether column names should be loaded and validated on sheet instatiation. However, this would have an impactt:

ExcelSheet instantiation would become slightly slower and less efficient memory-wise, as data loading & validation would occur on instantiation.
Errors would occur earlier: the column selection would be validated on sheet instantiation rather than on conversion
column_names could become a sheet attribute
It might be possible that this would allow to make functions obtaining data by ref lazy (not sure about this one, depends if I can elide lifetimes, as they don't play well with PyO3: https://pyo3.rs/main/class.html?highlight=lifetime#no-lifetime-parameters . Maybe by making the reader owned ?)

Any thoughts about this ?

PrettyWood commented 4 months ago

Yeah this is what I wanted originally when I said "Would be nice to add to ExcelSheet the range of indices and list of columns. Currently we only have the width" I'd rather have a slight drop in performance-wise but better API and validation. In my mind once you load the excelsheet you could have the range of column indices and names. And if you select some they are directly validated at instantiation with better error messages (even with fuzzysearch to say "did you mean this column?" which is clearly not needed but why not haha) Could we try to go with this approach and see the perf impact?

For your 4th point I have no idea to be honest. Clearly lifetimes are a mess with pyo3 (cf the discussion we had on bytes)

lukapeschke commented 4 months ago

@PrettyWood

script

import argparse

import fastexcel

def get_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser()
    parser.add_argument("file")
    parser.add_argument("-c", "--column", type=str, nargs="+", help="the columns to use")
    return parser.parse_args()

def main():
    args = get_args()
    excel_file = fastexcel.read_excel(args.file)
    use_columns = args.column or None

    for sheet_name in excel_file.sheet_names:
        excel_file.load_sheet_by_name(sheet_name, use_columns=use_columns).to_arrow()

if __name__ == "__main__":
    main()

before

invalid columns

❯ time python3.12 test.py ~/Downloads/TestExcel.xlsx --column coucou
Traceback (most recent call last):
  File "/home/lpeschke/work/toucan/fastexcel/test.py", line 24, in <module>
    main()
  File "/home/lpeschke/work/toucan/fastexcel/test.py", line 20, in main
    excel_file.load_sheet_by_name(sheet_name, use_columns=use_columns).to_arrow()
  File "/home/lpeschke/work/toucan/fastexcel/.direnv/python-3.12/lib64/python3.12/site-packages/fastexcel/__init__.py", line 65, in to_arrow
    return self._sheet.to_arrow()
           ^^^^^^^^^^^^^^^^^^^^^^
_fastexcel.ColumnNotFoundError: column with name "coucou" not found
Context:
    0: selected columns are invalid
    1: could not create RecordBatch from sheet "INVOICES"
    2: could not convert RecordBatch to pyarrow for sheet "INVOICES"
python3.12 test.py ~/Downloads/TestExcel.xlsx --column coucou  5.24s user 0.69s system 110% cpu 5.374 total

valid

❯ time python3.12 test.py ~/Downloads/TestExcel.xlsx
python3.12 test.py ~/Downloads/TestExcel.xlsx  5.90s user 0.75s system 109% cpu 6.089 total

after

invalid

Traceback (most recent call last):
  File "/home/lpeschke/work/toucan/fastexcel/test.py", line 23, in <module>
    main()
  File "/home/lpeschke/work/toucan/fastexcel/test.py", line 19, in main
    excel_file.load_sheet_by_name(sheet_name, use_columns=use_columns).to_arrow()
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lpeschke/work/toucan/fastexcel/.direnv/python-3.12/lib64/python3.12/site-packages/fastexcel/__init__.py", line 132, in load_sheet_by_name
    self._reader.load_sheet_by_name(
_fastexcel.ColumnNotFoundError: column with name "coucou" not found
Context:
    0: selected columns are invalid, available columns are: [<column list, elided>]

python3.12 test.py ~/Downloads/TestExcel.xlsx --column coucou  5.25s user 0.74s system 110% cpu 5.398 total

valid

❯ time python3.12 test.py ~/Downloads/TestExcel.xlsx                
python3.12 test.py ~/Downloads/TestExcel.xlsx  5.90s user 0.78s system 108% cpu 6.123 total

So the perf impact seems neglectible to me.

If the code LGTY, I'll expose the columns though an available_columns property. Souds good to you ?

PrettyWood commented 4 months ago

Perfect 🎉

lukapeschke commented 4 months ago

Great, I'll do it later this afternoon then. Anything else I need to change ?

PrettyWood commented 4 months ago

Nope I'll test the final PR tonight and will probably merge. Great work

ToucanToco / fastexcel

feat: allow to select a subset of columns #189

script

before

invalid columns

valid

after

invalid

valid