Open noctuid opened 1 month ago
It is the expected behavior, yes. The rationale is that if the dtype is specified, the user knows what kind of data is contained in the column.
I reckon it'd be nice to at least have a log in case the cell is of an unexpected type, but that would require work on calamine's side. We could probably add some methods to the DataType
trait which would return Result<Option<_>>
, with an error in case the cell is not empty, but isn't of the expected dtype either.
However we'd need to think a bit about this, since we don't want to affect performance (maybe through a new warn_on_unexpected_dtype
parameter ?).
@PrettyWood what are your thoughts on this ?
So does calamine just currently silently convert invalid data to nulls? Ideally, if bad data is converted to null, we'd need some way where we could get/add an extra bool column for each column to the dataframe to signify whether the data was a valid type or had to be converted to null.
Yes, calamine kinda silently converts unexpected data to nulls: it provides is_T() -> bool
methods for every excel datatype, along with get_T() -> Option<T>
. The get_T
methods return None
if they are called on an unexpected variant (get_datetime
on an int for example). You're supposed to call is_T
before using get_T
.
But for columns where the dtype is provided, fastexcel skips the inference step, since it requires iterating over a lot of values. This means, we directly use get_T
. A solution to this would be to add try_get_T
methods to calamine, which would allow to get an error in case the data is of an unexpected type.
But we'd also need to design an API around this in fastexcel: do we want to load the data with an attribute indicating that the cell contains invalid data ? Do we want to provide a parameter indicating what to do with errors ? This could be 'ignore' | 'store' | 'raise'
.
I'm afraid there's no quick and easy solution for this one
On 0.12.0, I'm not seeing an error if there is a row with e.g."foo" for a column that is set to "date" in "dtypes". That cell just is read as null. A string in a float column is also just converted to null. Is this expected? It would be great there was some way to get the locations of all failed cells, so that we could give the user back a file with bad cells highlighted.