Silent deletion of data should be construed as a bug, not the default behavior

ToucanToco / fastexcel

A Python wrapper around calamine

http://fastexcel.toucantoco.dev/

MIT License

120 stars 6 forks source link

Silent deletion of data should be construed as a bug, not the default behavior #277

Closed D3SL closed 1 month ago

D3SL commented 3 months ago

I just ran into the issue from #208 with someone I'm helping learn python. In fact that's actually the only reason we realised anything was wrong, because it was a small human readable practice dataset they knew some of the columns shouldn't have been completely empty. There were no warnings, no error messages, just data going missing completely silently. Nothing in Polars' breaking changes documentation suggested this was a possibility, and nothing in Fastexcel's documentation warns of possible silent data loss either.

Imho inferring a "Null" column should be treated as a failure state. A user should need to explicitly opt-in via argument to allow the possible loss of data with a warning. Deleting an entire column of data completely silently because an excel spreadsheet just happened to have been sorted in a particular order is a showstopping issue. I can not state in strong enough terms how much I disagree with the choice to have this as a default behavior.

GRaacM commented 1 month ago

Yeah I have to agree that the fact it is a default behaviour is problematic, should be opt in and at the bare minimum a warning should be posted.

PrettyWood commented 1 month ago

@lukapeschke I don't think we should set to None by default the sample rows but we should probably raise a warning if a column as the "null" type. Or maybe try to load more data on those columns by default

D3SL commented 1 month ago

As far as I know any conventional cell value in excel can be represented as a string. A safe option could be to default to string when a column type can't be inferred. That gives users a chance to explore the data with values presented kind of "as-is", rather than forcing them to open it by hand in excel.

GRaacM commented 1 month ago

Here the issue isn't that the type couldn't be inferred though. The issue is that when null is wrongly inferred it end up in data loss. I think a warning when null is inferred and/or having null inferred being opt in instead of default behaviour would be the best option.

PrettyWood commented 1 month ago

We plan to add a fallback on string when dtype could not be inferred by default, which seems to be more reasonable indeed. We'll do that in v0.13.0 since v0.12.0 is already planned for this month. In the meantime you can set schema_sample_rows=None to ensure every cells have been taken into account

adrivn commented 1 month ago

Hello, I referenced this issue in https://github.com/pola-rs/polars/issues/18612#issuecomment-2407485034, as the default schema inference fails when the number of empty records is high enough.

You can use my example .xlsx file attached in the issue mentioned above. It has over 100k records for the id column but for the second column, only ~3k contain values and they appear first very far into the file, effectively rendering the column "null" under the current fastexcel behavior. This shouldn't work like it does.

GRaacM commented 1 month ago

@PrettyWood If I understand correctly your commit you're now checking if the column is actually empty before assigning the datatype null ? And you do it by checking that the last row is also null ? (sorry if I shouldn't @ you for that not that used to github etiquette)

PrettyWood commented 1 month ago

If there is no data (meaning the range you chose to read is empty with an empty file or a header row too far away) the dataframe will be empty with null as dtype for its columns since nothing can be read at all. If there is data (99% of the time) and the sample size was not enough to determine the dtype we fallback on string and raise a warning to explain this fallback