delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
2.36k stars 414 forks source link

warning message #1457

Open dcbagger opened 1 year ago

dcbagger commented 1 year ago

When trying to load delta we get the warning. There seems to be no way to suppress it.

dt = DeltaTable(PATH)

df = dt.to_pandas(
            partitions = [(self.filter_column, '=', self.filter_value)],
            columns = self.projection_columns
        )
...
[2023-06-13T08:16:39Z WARN  deltalake::action::parquet_read] Unexpected type when parsing min/max values for TgTermFormatId. Found null
...

deltalake version: 0.10.0

rtyler commented 11 months ago

@dcbagger The original report mentions an older version of the library, do you happen to have a reproduction case you can share? At least on the Rust side of things log levels can be modified with something like env_logger or pretty_env_logger, but I'm not sure what the Python library might be setting

rtyler commented 10 months ago

I have found a way to reproduce this with the latest python release on a Spark written data set. I think the appropriate behavior for nulls is going to be to coerce that to zero for int types, etc.

ion-elgreco commented 10 months ago

@rtyler I don't think that would be correct though, replacing nulls with zero can be quite problematic since suddenly your distribution of values is messed up

rtyler commented 10 months ago

that's fair, what would you think the right behavior for handle nulls on file stats should be

roeap commented 10 months ago

not sure what the best behaviour in this case would be, but generally speaking we likely have some homework to do when it comes to how we process file stats. Mainly b/c null is not treated consistently by various engines when it comes to ordering, its either the higher or lowest value. So we have the problem that in a nullable column, we are also saying the lowest/highest values is null, which is different form "there are no stats".

We likely have to read the protocol a bit more to see if it takes a stance on this :). One thing I did find really quick is the discussion on the nullCount field in the protocol. https://github.com/delta-io/delta/blob/master/PROTOCOL.md#per-file-statistics