ToucanToco / fastexcel

A Python wrapper around calamine
http://fastexcel.toucantoco.dev/
MIT License
120 stars 6 forks source link

Get locations of cells with invalid types #308

Open noctuid opened 1 month ago

noctuid commented 1 month ago

On 0.12.0, I'm not seeing an error if there is a row with e.g."foo" for a column that is set to "date" in "dtypes". That cell just is read as null. A string in a float column is also just converted to null. Is this expected? It would be great there was some way to get the locations of all failed cells, so that we could give the user back a file with bad cells highlighted.

lukapeschke commented 1 month ago

It is the expected behavior, yes. The rationale is that if the dtype is specified, the user knows what kind of data is contained in the column.

I reckon it'd be nice to at least have a log in case the cell is of an unexpected type, but that would require work on calamine's side. We could probably add some methods to the DataType trait which would return Result<Option<_>>, with an error in case the cell is not empty, but isn't of the expected dtype either.

However we'd need to think a bit about this, since we don't want to affect performance (maybe through a new warn_on_unexpected_dtype parameter ?).

@PrettyWood what are your thoughts on this ?

noctuid commented 1 month ago

So does calamine just currently silently convert invalid data to nulls? Ideally, if bad data is converted to null, we'd need some way where we could get/add an extra bool column for each column to the dataframe to signify whether the data was a valid type or had to be converted to null.

lukapeschke commented 4 weeks ago

Yes, calamine kinda silently converts unexpected data to nulls: it provides is_T() -> bool methods for every excel datatype, along with get_T() -> Option<T> . The get_T methods return None if they are called on an unexpected variant (get_datetime on an int for example). You're supposed to call is_T before using get_T.

But for columns where the dtype is provided, fastexcel skips the inference step, since it requires iterating over a lot of values. This means, we directly use get_T. A solution to this would be to add try_get_T methods to calamine, which would allow to get an error in case the data is of an unexpected type.

But we'd also need to design an API around this in fastexcel: do we want to load the data with an attribute indicating that the cell contains invalid data ? Do we want to provide a parameter indicating what to do with errors ? This could be 'ignore' | 'store' | 'raise' .

I'm afraid there's no quick and easy solution for this one