ToucanToco / fastexcel

A Python wrapper around calamine
http://fastexcel.toucantoco.dev/
MIT License
82 stars 4 forks source link

`column_names` are taken partially when `use_columns` does not include all columns #214

Closed wikiped closed 3 months ago

wikiped commented 3 months ago

Given this worksheet data (without any empty area):

A B C
21 22 23
31 32 33
41 42 43

The following code:

import fastexcel as fe

file = r'<path to file>'
params = {'idx_or_name': 0, 'header_row': None, 'skip_rows': 1, 'use_columns': [1, 2], 'column_names': ['Col B', 'Col C']}

print(fe.read_excel(file).load_sheet(**params).to_polars())

Outputs:

shape: (3, 2)
┌───────┬──────────────┐
│ Col C ┆ __UNNAMED__2 │
│ ---   ┆ ---          │
│ f64   ┆ f64          │
╞═══════╪══════════════╡
│ 22.0  ┆ 23.0         │
│ 32.0  ┆ 33.0         │
│ 42.0  ┆ 43.0         │
└───────┴──────────────┘

Expected output:

shape: (3, 2)
┌───────┬───────┐
│ Col B ┆ Col C │
│ ---   ┆ ---   │
│ f64   ┆ f64   │
╞═══════╪═══════╡
│ 22.0  ┆ 23.0  │
│ 32.0  ┆ 33.0  │
│ 42.0  ┆ 43.0  │
└───────┴───────┘
lukapeschke commented 3 months ago

Hi @wikiped this is actually the expected behaviour: column_names is used to specify the name of the columns in the resulting DataFrame. Since there are three columns and only two names were provided, the third name is generated based on the column index.: https://fastexcel.toucantoco.dev/fastexcel.html#ExcelReader.load_sheet

Achieving the result you want is possible with the use_columns parameter:

import fastexcel as fe

file = 'repro_214.xlsx'
params = {'idx_or_name': 0, 'use_columns': ['B', 'C']}

print(fe.read_excel(file).load_sheet(**params).to_polars())

Will print

shape: (3, 2)
┌──────┬──────┐
│ B    ┆ C    │
│ ---  ┆ ---  │
│ f64  ┆ f64  │
╞══════╪══════╡
│ 22.0 ┆ 23.0 │
│ 32.0 ┆ 33.0 │
│ 42.0 ┆ 43.0 │
└──────┴──────┘

However there is indeed a bug when mixing use_columns and column_names, so thanks for the report!

import fastexcel as fe

file = 'repro_214.xlsx'
params = {'idx_or_name': 0, 'header_row': None, 'skip_rows': 1, 'column_names': ['Col B', 'Col C'], 'use_columns': [1, 2]}

print(fe.read_excel(file).load_sheet(**params).to_polars())

Returns

shape: (3, 2)
┌───────┬──────────────┐
│ Col C ┆ __UNNAMED__2 │
│ ---   ┆ ---          │
│ f64   ┆ f64          │
╞═══════╪══════════════╡
│ 22.0  ┆ 23.0         │
│ 32.0  ┆ 33.0         │
│ 42.0  ┆ 43.0         │
└───────┴──────────────┘

And the expected result would be

shape: (3, 2)
┌───────┬───────┐
│ Col B ┆ Col C │
│ ---   ┆ ---   │
│ f64   ┆ f64   │
╞═══════╪═══════╡
│ 22.0  ┆ 23.0  │
│ 32.0  ┆ 33.0  │
│ 42.0  ┆ 43.0  │
└───────┴───────┘

repro_214.xlsx

wikiped commented 3 months ago

Thank you for the feedback, I realized now that I forgot to paste the significant part of the parameters used: 'use_columns': [1, 2], which is, as you noted, part of the reason for the raising the issue.

lukapeschke commented 3 months ago

Hi @wikiped v0.10.2 was just released and fixes this

wikiped commented 3 months ago

Thank you for prompt fix!