delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.62k stars 1.71k forks source link

[Feature Request]Delta kernel can not get file stats #3771

Open dongxiao1198 opened 1 month ago

dongxiao1198 commented 1 month ago

Feature request

Which Delta project/connector is this regarding?

Overview

Since the delta-standalone has been deprecated, we are migrating out project using delta-kernel instead of delta-standalone. But we found that delta-kernel can not get file stats when scanning file lists.

In delta-standalone, we can get file stats in this class : . And we can get the change logs using "Iterator getChanges" in io.delta.standalone.DeltaLog which can not be list in delta-kernel too.

Motivation

Further details

Willingness to contribute

wgtmac commented 2 weeks ago

@nastra Could you please take a look at this?

nastra commented 2 weeks ago

FYI @scottsand-db

scottsand-db commented 2 weeks ago

Hi @wgtmac -- can you please tell me a bit more about your use case for file stats and for getChanges?

We allow you to include a filter during the ScanBuilder -- what more would you need the file stats for?

Could you also please look at this internal (not public) API for getChanges in Kernel and see if that fits your use case? We can consider making it public.

https://github.com/delta-io/delta/blob/6ae4b62845ed579bb5a19f4646831c4ee2931c02/kernel/kernel-api/src/main/java/io/delta/kernel/internal/TableImpl.java#L182

wgtmac commented 2 weeks ago

Thanks for the reply from @scottsand-db and help from @nastra!

We use the delta kernel as a metadata client in our proprietary lakehouse to read from delta lake tables. To efficiently make splits at any snapshot and cache the file lists, we need to get following metadata from the API which is available in delta standalone:

  1. Column stats: Carry the column stats (at least the min/max values, if available) of each parquet file, therefore we can prune the list of files to scan at our best effort.
  2. Get latest snapshot version: A cheap way to return the current version without actually replaying the delta logs.
  3. Get change logs between arbitrary snapshots: sometimes we need to cache file list of a specific version and then incrementally sync it to the latest version. It would be great if the delta client supports incremental scan to return file list changes between a specified version range.
  4. Stateful table object: This is similar to the request 3 above. The current table object is pined to a snapshot and cannot call update() to incrementally sync to the latest version, which the standalone library supports.

Hopefully my explanation makes sense.