delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.62k stars 1.71k forks source link

[BUG] Delta chronically behind Databricks #1775

Open th0ma5w opened 1 year ago

th0ma5w commented 1 year ago

Bug

Describe the problem

When trying to interact with the Databricks produced Delta objects, there is no open source compatible version, and so this project is a de facto advertisement and vendor lock in for the Databricks platform.

Steps to reproduce

Observed results

Expected results

Further details

This is all thoroughly and publicly documented in this project's docs and issues.

Environment information

On vs. off Databricks platform

Willingness to contribute

I would be willing to test.

th0ma5w commented 1 year ago

IANAL and db is free to use as best I can read this software to provide this product, but it seems against at least the spirit of the license to call it by the same name and to claim it is open source (like).

roniburd22 commented 1 year ago

would be interested to know what are your findings? It is hard to tell from the comments.

dennyglee commented 1 year ago

Hi @th0ma5w - I'd like to understand your context better as this issue does not provide the necessary background. If you would like, feel free to ping me at denny[dot]lee[at]databricks.com, and I'm glad to have a conversation at your convenience.

And if you're up for it, we can summarize our conversation in this thread to provide context for complete and full transparency.

th0ma5w commented 1 year ago

If someone were to provide me a backup of files produced using Delta Lake related product offerings, is there any code at https://github.com/delta-io that would be 100% round trip feature complete for both reading and writing these files? When I look at roadmaps here, there is talk of "supporting higher protocol versions" and perhaps by the end of the year. When I see other vendors describing how to do this, they don't mention any of the resources here https://learn.microsoft.com/en-us/azure/data-factory/connector-azure-databricks-delta-lake?tabs=data-factory#delta-lake-as-source I understand I could get a lot done with the code in this project and the others here under this organization, but it seems to be though that this software does not do this one thing though, and no libraries here are compatible with Databricks, although Databricks may, I believe, be able to work with files produced by this project? If this is wrong, I would most certainly apologize, but this certainly seems like my reading of the documentation of these projects and the implied separation on this link between the product offering and the open source projects https://www.databricks.com/blog/2019/08/21/diving-into-delta-lake-unpacking-the-transaction-log.html ... Or is there more explicit documentation on how they differ?

th0ma5w commented 1 year ago

So I kept googling here, and I read a lot about the open sourcing of everything about a year ago. Around the same time there was this blog post https://www.dremio.com/blog/table-format-partitioning-comparison-apache-iceberg-apache-hudi-and-delta-lake/ Now I know they do not speak for this project or any business for sure, but they do outline the differences between OSS and DB write compatibility. I guess I see commitments for resolving this, but has it, in fact, been resolved?

dennyglee commented 1 year ago

Hi @th0ma5w - let me try to answer some of your questions:

Per your comment:

I understand I could get a lot done with the code in this project and the others here under this organization, but it seems to be though that this software does not do this one thing though, and no libraries here are compatible with Databricks, although Databricks may, I believe, be able to work with files produced by this project?

While Table Features certainly simplifies our ability to flag and document this process, a key thing missing is the test infrastructure to validate that all of these different APIs actually perform as they state (via documentation, code, or otherwise). Because of this project's pace and so many different users, contributors, and organizations that are using Delta, we I have not done a good job keeping up with this.

To address this, we are fortunate that we started the Delta Acceptance Testing or DAT project earlier this year. We have had various meetings and contributions initially with the Python, Rust, Spark, and Trino contributors (note this is open to everyone) to build up test infrastructure so that all APIs could both document and have associated test cases (i.e., pass/fail) for each of the different features. It is still early for this project (hence why its part of delta-incubator). Still, it allows us to test and validate that the framework will address the needs of the community as a whole (e.g., different sets of validation queries between Trino SQL, Datafusion, Polars, Spark SQL, etc.).

Per your comment:

If this is wrong, I would most certainly apologize, but this certainly seems like my reading of the documentation of these projects and the implied separation on this link between the product offering and the open source projects https://www.databricks.com/blog/2019/08/21/diving-into-delta-lake-unpacking-the-transaction-log.html ... Or is there more explicit documentation on how they differ?

No need to apologize; that's the whole point of these forums for discussion! The blog post you mentioned Diving Into Delta Lake: Unpacking The Transaction Log, which you can also watch here, is discussing how the Delta transaction protocol itself works. While it is using Apache Spark™ as its example (this blog was written in 2019, shortly after our initial open-sourcing of Delta Lake), the protocol itself is API agnostic. For example, here's a great webcast with @houqp who had created Delta Rust - D3L2: The Genesis of Delta Rust with QP Hou discussing how he created the Delta Rust via the protocol documentation over a few weekends.

As for the implied difference between the OSS vs. Databricks offering, we have remedied to fix this as part Delta Lake 2.0 that was announced in Data + AI Summit 2022. Some information to help provide context:

As for your callout for better documentation on how they differ, you are right, and I have called myself out in Delta Users Slack in this thread

HTH!

MrPowers commented 1 year ago

@th0ma5w - thanks for opening this issue.

delta-io/delta is a reference implementation of the Delta Lake transaction log protocol. There are multiple implementations of the Delta Lake transaction log protocol that are in various stages of development including delta-io/delta-rs and dask-contrib/dask-deltatable.

When trying to interact with the Databricks produced Delta objects, there is no open source compatible version

There are multiple open source implementations that are interoperable with Delta tables created by Databricks. This is one of the main benefits of the Lakehouse architecture.

If someone were to provide me a backup of files produced using Delta Lake-related product offerings, is there any code at https://github.com/delta-io that would be 100% round trip feature complete for both reading and writing these files?

Yes, and readers/writers that implement the Delta Lake transaction log protocol are interoperable. Your question refers to the delta-io GitHub organization (not this particular repo), so I'll give a more generic answer. A Delta table that's written can only be read by a library that supports the protocol version. A Delta table that's written with deletion vectors enabled can only be read by a library that supports deletion vectors. At the time, delta-io/delta can read Delta tables with deletion vectors enabled and delta-io/delta-rs cannot.

I created a diagram the other day that shows how you can create a Delta table with PySpark (delta-io/delta), append to it with pandas (delta-io/delta-rs), and read it with Polars that might help highlight the interoperable nature of Delta tables:

Screenshot 2023-05-22 at 8 19 09 AM

I understand I could get a lot done with the code in this project and the others here under this organization, but it seems to be though that this software does not do this one thing though, and no libraries here are compatible with Databricks, although Databricks may, I believe, be able to work with files produced by this project?

There is full interoperability between Delta tables written in compliance with the Delta Lake transaction protocol and Delta Lake readers that support the Delta Lake transaction log protocol. This is the beauty of the Lakehouse architecture. A user can spin up a machine that ingests data from Kafka into a Delta table using delta-io/kafka-delta-ingest and that Delta table can be read by any technology that follows the protocol.

So I kept googling here, and I read a lot about the open sourcing of everything about a year ago. Around the same time there was this blog post https://www.dremio.com/blog/table-format-partitioning-comparison-apache-iceberg-apache-hudi-and-delta-lake/ Now I know they do not speak for this project or any business for sure, but they do outline the differences between OSS and DB write compatibility. I guess I see commitments for resolving this, but has it, in fact, been resolved?

I think you're referring to the "Tool Write Capability" at the top of the blog post in the diagram. I looked at this for 10 minutes and don't understand what the diagram is trying to communicate. I think the diagram is implying that there is not an open source Flink writer for Delta Lake? Here is the open source Delta Lake Flink connector with write support. I don't want to go into that blog post in detail, but it seems to be making several incorrect/misleading claims and omits important parts of the conversation like Z Ordering.

Thanks again for opening this issue. Delta Lake is a Lakehouse Storage System, as described in this paper, and any implementation that abides by the spec will, in turn, allow for an interoperable Lakehouse Architecture. Let me know if you have any additional questions.

th0ma5w commented 1 year ago

I feel like this is describing perhaps several layers of potential compatibility with various specific features... Is there a feature matrix available? Is it known or documented somewhere what the differences are between collections produced by the Databricks platform, or are we saying that there is nothing but the open source versions in use at Databricks today?

dennyglee commented 1 year ago

You're right - per this thread - it's on my priority list to provide.

MrPowers commented 1 year ago

@th0ma5w - Yea, each individual project should clearly indicate the protocol version/table features that are supported.

I was just going to open an issue to request this in delta-io/delta-rs and realize that @roeap beat me to it in this PR: https://github.com/delta-io/delta-rs/pull/1440/files

dennyglee commented 1 year ago

Hi @th0ma5w - from your perspective, when we document this, would this address your concerns? No worries, I'm not trying to close this issue, I just wanted to determine the correct action items. Much appreciated!

th0ma5w commented 1 year ago

I think this does go very far in that if I was tasked to do this, but it might be hard to unpack depending on how it is documented. I guess I'm not ecstatic about the answer to "can you take a Databricks hosted Delta dependent process and use open source tools to operate it" the answer is "probably not if it is very old, but maybe more probable if it is newer, but lets see if you can solve our feature matrix puzzle." .... What are the things that are only working on the Databricks platform would be very helpful, and also, kindly, is there a way I can contact Databricks legal or something so that the Databricks organization stops calling this stuff open source?

felipepessoto commented 3 months ago

@dennyglee I've being actively opening issues/PRs in the past year, and I have some suggestions that might help improve the perception about Delta OSS. If you are interested, please send me a message on Slack.

th0ma5w commented 3 months ago

I did get the ping on this today, I see there is no movement, and Delta is still advertised as an open source offering as a part of the Databricks service offerings, and that continues to not seem true to me given the details of this issue.

I do not have any information which is currently not been made public to refute it, so this all seems factual here... I guess I see it as:

  1. Delta is an offering of Databricks as a commercial service.
  2. This project has been pushed as a version that has been made open source but differs from that offering in undocumented ways.
  3. While the claim on the site about Delta being open source seems to be true insomuch that this project exists, it does not make clear that you would not be using the open source version potentially as a customer, and anything created with the Databricks service would only be 100% guaranteed to work within the proprietary commercial service.
  4. There is a possibility it could work, but if you do not align with this yet to be published feature matrix than your work with the commercial product would potentially not be compatible with the published open source code, depending on which features you're using and how long you have used the commercial product.

Is there any disagreement with these items? I guess I saw a thumbs up from maintainers here on an earlier post but I wanted to re summarize it all again for clarity.

MrPowers commented 3 months ago

@th0ma5w - thanks for the ping.

Delta is an offering of Databricks as a commercial service.

Delta is a Lakehouse storage system. Implementations should follow the Delta Lake transaction log protocol. There are many implementations like the Microsoft Fabric Lakehouse implementation.

This project has been pushed as a version that has been made open source

This repo contains delta-spark, delta-kernel, and Delta Flink. Other repos contain the Rust implementation, the C Sharp implementation, and the Dask connector. You should be able to fetch the supported protocol versions/table features in each connector. If not, feel free to file an issue on the respective repo.

it does not make clear that you would not be using the open source version potentially as a customer

The word "customer" is ambiguous. There is Amazon EMR Delta Lake, Microsoft Fabric Delta Lake and Delta Lake BigQuery. So yes, if someone makes a Delta table with delta-rs and enables a table feature that's not yet supported by another implementation than a user may face issues.

but if you do not align with this yet to be published feature matrix than your work with the commercial product would potentially not be compatible with the published open source code

Yea, this is possible and we're trying to solve the issue with Delta Kernel.

delta-rs recently started depending on delta-kernel-rs.

Apache Druid is being built with Delta Kernel Java, see this PR.

We just chatted with the open source community about updating delta-dotnet to depend on delta-kernel-rs.

Is there any disagreement with these items?

I agree that the compatibility issues need to get better and the devs have been working a lot on Delta Kernel Rust and Delta Kernel Java to solve this in a sustainable manner. All Lakehouse storage systems suffer from this issue. It's a really hard engineering problem. So we're trying and investing a ton of resources, but it's really hard. Kernel implementations need to be two-way, zero dependency, and engine agnostic. Delta Kernel Rust needs a clean C FFI.

There has been lots of forward progress since your initial message. Look at the 15,000+ lines of Rust code in delta-kernel-rs and all the Delta Kernel Java code in this repo. Fully building out Delta Kernel and getting it integrated into the entire connector ecosystem is a lot of work.

Delta is for all engines and all runtime environments. It seems like you're focusing on one engine and one runtime, but we're trying to solve a much broader problem.

th0ma5w commented 2 months ago

@MrPowers

Delta is for all engines and all runtime environments. It seems like you're focusing on one engine and one runtime, but we're trying to solve a much broader problem.

  1. Does the product grant me the ability to offer a commercial service using this software plus proprietary and incompatible changes and call it Open Source Delta ?
  2. Are you saying the Rust product is 100% compatible with anything developed on Databricks thus far?

There has been lots of forward progress since your initial message.

Is there a document that states that Databricks is now using an 100% open source compatible implementation? At the time of this ticket, and as best as I can google now, they say that their Delta offering is open source. My experience at the time of opening this ticket was no public repo of any code that I could find written in any language was compatible with the files I was trying to read. Not being a direct customer of Databricks, I could only open a ticket here.

Fully building out Delta Kernel and getting it integrated into the entire connector ecosystem is a lot of work.

If Delta is Open Source, why couldn't the DataBricks implementation simply be published?

I guess my current status is that I cannot recommend Delta for anyone that has even been a customer of Databricks because there is simply no way to prove compatibility, or for that matter, any clear incompatibility?

Yea, this is possible and we're trying to solve the issue with Delta Kernel.

So I guess the best thing to be said then is that it is being worked on to make the product Open Source but I'm not sure why you replied, there is no new information I guess?