Closed deanm0000 closed 8 months ago
Added a file which represents a related issue. The columns representing hours, minutes, seconds spent in a state are returning as nulls for some reason.
excel_reader = fastexcel.read_excel(local_path)
pl_df = excel_reader.load_sheet(idx_or_name=0, header_row=9).to_polars()
If I read the same file with pandas the values are returned (pd_df = pd.read_excel(local_path, engine="openpyxl", skiprows=10)
And one more example:
test_excel_engine.xlsx Another issue is that mixed columns which have values that look like numbers are being replaced with nulls.
You can see anyone that has an ID with text values has been replaced by nulls and the column is interpreted as a float
Thank you for the clean bug report, and especially for the files, they make debugging much easier :pray:
I do indeed reproduce the three issues you're describing. It seems to me that they a re caused by two distinct issues:
as_time
API to convert durations to milliseconds, as the as_duration
API was not available yet. However, this does not work anymore, as as_time
will return None
for Duration
values. The fix is very easy, and will be out soonI'll create two child issues for this.
@deanm0000
There should be a way to override type inference
Could you please create a separate issue for that ? It definitely seems like a reasonable feature to ask for, but it will need some discussion so we can agree on an API.
@ldacey Would it be OK with you if I added your files as test cases to the repo ? Or should I modify the data in them ?
You can add those files - the data is all fake but the structure and columns match some real sources we get from some ancient reporting tools.
There are 3 issues:
Hi - 0.9.0 works great for most of the files I needed to process but there is something odd going on if there are blank rows.
The first sheet works fine, but the second sheet fails (that reflects how actual data comes in from the source system):
File line 83, in
pl_df = excel_reader.load_sheet(idx_or_name=1, header_row=9).to_polars() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File line 64, in to_polars df = pl.from_arrow(data=self.to_arrow()) ^^^^^^^^^^^^^^^ File line 47, in to_arrow return self._sheet.to_arrow() ^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: Could not create RecordBatch from sheet Sheet2 Caused by: 0: Could not build schema for sheet Sheet2 1: could not figure out column type for following type combination: {Timestamp(Millisecond, None), Utf8}
I could not figure out the precise cause. There is one row per employee ID per date even if they did not log into the system but fastexcel does not seem to like it if the first rows are null for the value columns.
Same issue if I try using to_pandas:
File line 55, in to_pandas return self.to_arrow().to_pandas() ^^^^^^^^^^^^^^^ File line 47, in to_arrow return self._sheet.to_arrow()
Looks like all of the issues I ran into have been addressed in 0.10.
The duration, timestamp, and date columns are parsed correctly (had some issues with openpyxl with the duration columns, and issues with the blank rows with fastexcel). The employee ID column also retains the IDs which look like integers instead of replacing them with nulls.
Very cool, thanks for the work on this.
Glad to know it works for you!
Here's a file that when I open it with:
then the
Source PNODEID
column comes through as float64 with a bunch of nulls. I'd prefer it to be an int but it seems it should at least return a String rather than Float with nulls.My workaround is to use calamine-python like this, not sure if there's a better way
To sum up: