Closed D3SL closed 1 month ago
Yeah I have to agree that the fact it is a default behaviour is problematic, should be opt in and at the bare minimum a warning should be posted.
@lukapeschke I don't think we should set to None
by default the sample rows but we should probably raise a warning if a column as the "null" type. Or maybe try to load more data on those columns by default
As far as I know any conventional cell value in excel can be represented as a string. A safe option could be to default to string when a column type can't be inferred. That gives users a chance to explore the data with values presented kind of "as-is", rather than forcing them to open it by hand in excel.
Here the issue isn't that the type couldn't be inferred though. The issue is that when null is wrongly inferred it end up in data loss. I think a warning when null is inferred and/or having null inferred being opt in instead of default behaviour would be the best option.
We plan to add a fallback on string when dtype could not be inferred by default, which seems to be more reasonable indeed.
We'll do that in v0.13.0 since v0.12.0 is already planned for this month.
In the meantime you can set schema_sample_rows=None
to ensure every cells have been taken into account
Hello, I referenced this issue in https://github.com/pola-rs/polars/issues/18612#issuecomment-2407485034, as the default schema inference fails when the number of empty records is high enough.
You can use my example .xlsx file attached in the issue mentioned above. It has over 100k records for the id column but for the second column, only ~3k contain values and they appear first very far into the file, effectively rendering the column "null" under the current fastexcel behavior. This shouldn't work like it does.
@PrettyWood If I understand correctly your commit you're now checking if the column is actually empty before assigning the datatype null ? And you do it by checking that the last row is also null ? (sorry if I shouldn't @ you for that not that used to github etiquette)
If there is no data (meaning the range you chose to read is empty with an empty file or a header row too far away) the dataframe will be empty with null
as dtype for its columns since nothing can be read at all.
If there is data (99% of the time) and the sample size was not enough to determine the dtype we fallback on string and raise a warning to explain this fallback
I just ran into the issue from #208 with someone I'm helping learn python. In fact that's actually the only reason we realised anything was wrong, because it was a small human readable practice dataset they knew some of the columns shouldn't have been completely empty. There were no warnings, no error messages, just data going missing completely silently. Nothing in Polars' breaking changes documentation suggested this was a possibility, and nothing in Fastexcel's documentation warns of possible silent data loss either.
Imho inferring a "Null" column should be treated as a failure state. A user should need to explicitly opt-in via argument to allow the possible loss of data with a warning. Deleting an entire column of data completely silently because an excel spreadsheet just happened to have been sorted in a particular order is a showstopping issue. I can not state in strong enough terms how much I disagree with the choice to have this as a default behavior.