Open dcbagger opened 1 year ago
@dcbagger The original report mentions an older version of the library, do you happen to have a reproduction case you can share? At least on the Rust side of things log levels can be modified with something like env_logger
or pretty_env_logger
, but I'm not sure what the Python library might be setting
I have found a way to reproduce this with the latest python release on a Spark written data set. I think the appropriate behavior for nulls is going to be to coerce that to zero for int types, etc.
@rtyler I don't think that would be correct though, replacing nulls with zero can be quite problematic since suddenly your distribution of values is messed up
that's fair, what would you think the right behavior for handle nulls on file stats should be
not sure what the best behaviour in this case would be, but generally speaking we likely have some homework to do when it comes to how we process file stats. Mainly b/c null
is not treated consistently by various engines when it comes to ordering, its either the higher or lowest value. So we have the problem that in a nullable column, we are also saying the lowest/highest values is null
, which is different form "there are no stats".
We likely have to read the protocol a bit more to see if it takes a stance on this :). One thing I did find really quick is the discussion on the nullCount
field in the protocol. https://github.com/delta-io/delta/blob/master/PROTOCOL.md#per-file-statistics
When trying to load delta we get the warning. There seems to be no way to suppress it.
deltalake version: 0.10.0