parameter for dtype override option AND/OR better inference

deanm0000 commented 8 months ago

Here's a file that when I open it with:

filename='2017-2018-annual-auction-round-2-results.xls'
sheetname='Annual 2017-18 Results RD 2'
df = fastexcel.read_excel(filename).load_sheet_by_name(sheetname).to_polars()

then the Source PNODEID column comes through as float64 with a bunch of nulls. I'd prefer it to be an int but it seems it should at least return a String rather than Float with nulls.

My workaround is to use calamine-python like this, not sure if there's a better way

cal=CalamineWorkbook.from_path(filename)
df2=pl.from_records(cal.get_sheet_by_name(sheetname).to_python(), orient='row')
df2.columns=df2.rows()[0]
df2.with_columns(pl.col('Source PNODEID').cast(pl.Int64)) 
## I have to cast all the non-string columns but this works, no value raises

To sum up:

The type inference has an issue here.
There should be a way to override type inference

ldacey commented 8 months ago

example-skip-rows.xlsx

Added a file which represents a related issue. The columns representing hours, minutes, seconds spent in a state are returning as nulls for some reason.

excel_reader = fastexcel.read_excel(local_path)
pl_df = excel_reader.load_sheet(idx_or_name=0, header_row=9).to_polars()

If I read the same file with pandas the values are returned (pd_df = pd.read_excel(local_path, engine="openpyxl", skiprows=10)

And one more example:

test_excel_engine.xlsx Another issue is that mixed columns which have values that look like numbers are being replaced with nulls.

You can see anyone that has an ID with text values has been replaced by nulls and the column is interpreted as a float

lukapeschke commented 8 months ago

Thank you for the clean bug report, and especially for the files, they make debugging much easier :pray:

I do indeed reproduce the three issues you're describing. It seems to me that they a re caused by two distinct issues:

For durations, we're we using calamine's as_time API to convert durations to milliseconds, as the as_duration API was not available yet. However, this does not work anymore, as as_time will return None for Duration values. The fix is very easy, and will be out soon
Type coercion is currently done on the first cell of the column only. It should be done on the entire column. The fix for that will be longer, as we need to figure out a way to do it without impacting performance too much.

I'll create two child issues for this.

@deanm0000

There should be a way to override type inference

Could you please create a separate issue for that ? It definitely seems like a reasonable feature to ask for, but it will need some discussion so we can agree on an API.

@ldacey Would it be OK with you if I added your files as test cases to the repo ? Or should I modify the data in them ?

ldacey commented 8 months ago

You can add those files - the data is all fake but the structure and columns match some real sources we get from some ancient reporting tools.

PrettyWood commented 8 months ago

There are 3 issues:

1 issue with milliseconds that has been fixed in https://github.com/ToucanToco/fastexcel/issues/159
1 issue with dtype guessing tracked by https://github.com/ToucanToco/fastexcel/issues/160
1 feature request: to be able to enforce dtypes tracked by https://github.com/ToucanToco/fastexcel/issues/173

ldacey commented 7 months ago

Hi - 0.9.0 works great for most of the files I needed to process but there is something odd going on if there are blank rows.

example-skip-rows.xlsx

The first sheet works fine, but the second sheet fails (that reflects how actual data comes in from the source system):

File line 83, in pl_df = excel_reader.load_sheet(idx_or_name=1, header_row=9).to_polars() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File line 64, in to_polars df = pl.from_arrow(data=self.to_arrow()) ^^^^^^^^^^^^^^^ File line 47, in to_arrow return self._sheet.to_arrow() ^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: Could not create RecordBatch from sheet Sheet2

Caused by: 0: Could not build schema for sheet Sheet2 1: could not figure out column type for following type combination: {Timestamp(Millisecond, None), Utf8}

I could not figure out the precise cause. There is one row per employee ID per date even if they did not log into the system but fastexcel does not seem to like it if the first rows are null for the value columns.

Same issue if I try using to_pandas:

File line 55, in to_pandas return self.to_arrow().to_pandas() ^^^^^^^^^^^^^^^ File line 47, in to_arrow return self._sheet.to_arrow()

ldacey commented 6 months ago

Looks like all of the issues I ran into have been addressed in 0.10.

The duration, timestamp, and date columns are parsed correctly (had some issues with openpyxl with the duration columns, and issues with the blank rows with fastexcel). The employee ID column also retains the IDs which look like integers instead of replacing them with nulls.

Very cool, thanks for the work on this.

lukapeschke commented 6 months ago

Glad to know it works for you!

ToucanToco / fastexcel

parameter for dtype override option AND/OR better inference #158