Open vkorukanti opened 1 year ago
Attached the design doc.
Thank you @vkorukanti
this is awesome! I already developed a connector for Go: https://github.com/csimplestring/delta-go definitely I will refactor to follow this design !
This looks great! In the initial release, the plan is to implement the Kernel APIs in which languages?
We are starting with Java, and we are definitely interested in Rust. Beyond that I definitely want to discuss with the community about more languages. For example, @csimplestring I wonder whether if there is a Delta Kernel implemented in Rust, can the Go implementation just call into it? I am not familiar in the Rust ecosystem, and maintainers of delta-io/delta-rs like @rtyler @wjones127 have centuries of more experience about such matters. But I hope we can just build a Rust Kernel and other close-to-native languages just use it.
Hi @felipepessoto @tdas I am keeping a closer eye on this Java version kernel api development. Yes, definitely I plan to support this in Go.
In this repo: https://github.com/csimplestring/delta-go, it is a Go implementation of the Scala version Standalone Connector, which all features are supported except for the s3-multi cluster log store.
For the delta kernel, I created a new repo and closely follow the development of this Java version, which adding the API interfaces first. I think it is not difficult to reuse the code.
Thanks @tdas, and about Spark Scala version, is it expected changes? Like refactoring it to work on top of a Scala Kernel API?
@felipepessoto We are very far right now to have a concrete plan for the Spark Scala version which is already very optimized for the Spark platform. We have to design and build the Delta Kernel write support first.
I'm interested and happy to help where I can. I'm putting together a data access layer over top of our delta lake, and it is mostly Rust right now. Consistency would be key to helping get this out the door (jumping between Scala, Python, and Rust in the world of Delta/Spark is a bit of an interesting trip).
Feature request
This is an uber issue for designing and implementing APIs (Delta Kernel) to unify and simplify APIs for connectors to read Delta Lake tables. Currently the focus is on reading, support for writing will be added later.
Motivation
Delta connector ecosystem is currently fragmented with too many independent protocol implementations - Delta Spark, Delta Standalone, Trino, delta-rs, Delta Sharing etc. This leads to the following problems:
High variability in performance and bugs in connectors - Each implementation tries to implement the spec in different ways causing suboptimal performance and data correctness bugs.
Sluggish adoption of new protocol features - Whenever there is a protocol update, every implementation needs to be updated separately. Furthermore, even when multiple connectors share the log replay implementation, each connector currently requires deep understanding of the protocol details for the actual data operations (i.e., reads, writes, upserts) to be implemented correctly. For example, Delta Standalone hides the details of log file formats, but ultimately exposes raw actions via programmatic APIs. Connectors using Standalone must understand all the nitty gritty details of the column stats in the
AddFile
s to use them correctly for skipping. Such friction prevents new connector creation as well as slows d own adoption of new protocol features in existing connectors.To reduce fragmentation and speed up the rate of innovation, we have to simplify and unify the connector ecosystem.
Simplify the programmatic APIs for building connectors - We want to build a "kernel" library (or a small set of them in different languages) that hides all the protocol details of all operations behind simple library APIs. Connectors will just use those APIs to get scan file data that it can forward to the engine without any interpretation of the underlying raw actions. The engine will just use the scan file data to read data using the Kernel APIs. For example, for reads,
ScanFile.read(scan file record)
to get rows without having to understand what file action the data is coming from.Unify the ecosystem - With these simplified APIs, we will be able to encourage new connectors to be built on them, and we can slowly convince the community to transition existing connectors to them too.
Further details
See the design doc for details. See the presentation for high level details.
Willingness to contribute
The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?
Project Plan
Delta 3.0
ColumnVector
interfaceDelta 3.1
id
column mapping modeLaundry List