delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.6k stars 1.71k forks source link

Support Delta Lake as Apache Beam I/O #243

Open chethanuk opened 5 years ago

chethanuk commented 5 years ago

Is there plans to integrate delta.io with Apache Beam? For example - ParquetIO is supported: https://beam.apache.org/documentation/io/built-in/

Delta being open-source storage layer having Apache Beam I/O can help us integrate directly into beam pipelines..

mukulmurthy commented 5 years ago

This isn't currently on our roadmap, but we'd be happy to provide support if someone from the community wanted to build this.

If Beam has any general-purpose connectors that use Spark's DataFrameWriter/DataFrameReader APIs, one option is to just use those. You can read from Delta by following the Delta protocol (https://github.com/delta-io/delta/blob/master/PROTOCOL.md), though we're also working on a general-purpose Delta Lake reader that will help with that. Writing to Delta is a bit trickier, as right now there's not really a supported way to do that outside of Spark.

marmbrus commented 5 years ago

245 tracks the work to build a library for querying delta metadata.

kesavkolla commented 3 years ago

It would be really good to develop read/write interface outside of spark. Specailly beam would be really ideal. In general which component of delta is tied to spark? I do see another project for connectors that also only works for reads no writes. What would be option for people who are building large scale data pipe lines? I'm looking to build a massive data pipeline which moves billions of rows. I believe that is where apache beam really make sense.

kennknowles commented 3 years ago

(Beam person here): One way to think about Beam, which we don't emphasize enough, is as a repository for connectors. You need the rest of Beam's pipeline model in order to build them. All of our most scalable and feature-rich connectors end up as somewhat complex subgraphs.

Which is all to say: if you want a portable connector that can scale and offer rich functionality, you should use Beam or you'll end up re-inventing Beam :-)

A DataFrameWriter/DataFrameReader connector for Beam sounds viable and probably useful to get something done quickly. Later, a directly-implemented Beam connector would probably scale and compose better.

dennyglee commented 3 years ago

Thanks to @mbenenso - you can currently read Delta Lake via Beam - you can find it at: https://github.com/mbenenso/beam-deltalake. We will keep this issue open for now once we finalize a performance enhancement to the Beam reader related to this PR: https://github.com/delta-io/connectors/pull/156. Thanks!

jeganthirumeni commented 2 years ago

@dennyglee @mbenenso Does delta lake standalone (https://github.com/mbenenso/beam-deltalake) support only as source or sink as well in beam ? I dont see an api for sink in DeltaFileIO.java.

mbenenso commented 2 years ago

When this code was developed, DeltaLake standalone supported only reading, so DeltaFileIO only provides Source functionality. The latest version of DeltaLake standalone also supports writing, so this code could be extended to provide Sink.

On Sat, Oct 29, 2022 at 12:49 AM Jegadesh Thirumeni < @.***> wrote:

@dennyglee https://github.com/dennyglee @mbenenso https://github.com/mbenenso Does delta lake standalone ( https://github.com/mbenenso/beam-deltalake) support only as source or sink as well in beam ? I dont see an api for sink in DeltaFileIO.java.

— Reply to this email directly, view it on GitHub https://github.com/delta-io/delta/issues/243#issuecomment-1295762529, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVJGAQEKGREF5ZHYFTL3DQ3WFTJH5ANCNFSM4JLOAJXQ . You are receiving this because you were mentioned.Message ID: @.***>

-- Thanks


Michael Benenson