delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
6.98k stars 1.6k forks source link

Roadmap 2021 H2 (discussion) #748

Closed dennyglee closed 2 years ago

dennyglee commented 2 years ago

This is the proposed Delta Lake 2021 H2 roadmap discussion thread. Below are the initial proposed items for the roadmap to be completed by December 2021. We will also be sending out a survey (we will update this issue with the survey) to get more feedback from the Delta Lake community!

Issue Description Target CY2021
#731 Improve Delta protocol to support changes such as column drop and rename Q3
#732 Support Spark’s column drop and rename commands Q3
#101 Streaming enhancements to the standalone reader to support Pulsar, Flink Q3
#85 Delta Standalone Writer: This feature will allow other connectors such as Flink, Kafka, and Pulsar to write to Delta. Q4
#733 Support Apache Spark 3.2 Q4
#110 Delta Source for Apache Flink: Build a Flink/Delta source (i.e., Flink reads from Delta Lake) potentially leveraging the Delta Standalone Reader. Join us via the Delta Users Slack #flink-delta-connector channel and we have bi-weekly meetings on Tuesdays. CY2022 Q1
#111 Delta Sink for Apache Flink: Build a Flink/Delta sink (i.e., Flink writes to Delta Lake) potentially leveraging the Delta Standalone Writer. Join us via the Delta Users Slack #flink-delta-connector channel and we have bi-weekly meetings on Tuesdays. Q4
#82 Delta Source for Trino: Build a Trino/Delta reader, potentially leveraging the Delta Standalone Reader. This is a community effort and all are welcome! Join us via the Delta User Slack channel #trino channel and we will have bi-weekly meetings on Thursdays. Q3
#338 Delta Rust API: Formally verify S3 multi-writer design using stateright Q4
#339 Delta Rust API: Low level API for creating new delta tables Q3
#545 Nessie / Delta Integration: Build tighter integration between Nessie and Delta to allow for Nessie’s Git-like experience for data lakes to work with Delta Lake. This is a community effort and all are welcome! Join us via the Delta User Slack channel #nessie channel and we have bi-weekly meetings on Tuesdays. Q4
LakeFS / Delta Integration: Build tighter integration between Nessie and Delta to allow for Nessie’s Git-like experience for data lakes to work with Delta Lake. This is a community effort and all are welcome! Join us via the Delta User Slack channel #lakefs channel and we will have bi-weekly meetings soon. Q4
#112 Delta Source for Apache Pulsar: Build a Pulsar/Delta reader, potentially leveraging the Delta Standalone Reader. Join us via the Delta Users Slack connector-pulsar channel. Q3
#94 Power BI Connector: Fix issue with data sources that do not support streaming of binary files Q3
#103 Power BI Connector: Add inline-documentation to PQ function Q3
#104 PowerBI: Add support for TIMESTAMP AS OF Q4
#36, #116 Update the existing Hive 2 connector ala Delta Standalone Reader to support Hive 3. Q3
#746 Restructure delta.io website: Update delta.io website to allow for community blogs, include top community contributors, updated how-to-contribute guide and place the code-base into GitHub. Q3
#747 Delta Guide: Update the Delta documentation to include a Delta guide. Q4

If there are other issue that should be considered within this roadmap, let's have a discussion here or via the Delta Users Slack #deltalake-oss channel.

melin commented 2 years ago

Open-source version, consider supporting OPTIMIZE ZORDER BY?

dennyglee commented 2 years ago

@melin This is a great idea and definitely something we're considering!

melin commented 2 years ago

support show partitions tablename sql

chengat1314 commented 2 years ago

@dennyglee can we consider adding delta writer support in Trino(Presto)? the use case for delta writer support in Trino(Presto) mainly for CTAS(CREATE TABLE AS SELECT query). CTAS contributed more than 30% of our Trino(Presto) workload.

nicknezis commented 2 years ago

I would love to add a Delta Source and Sink for Apache Heron (perhaps we can collaborate with the equivalent Flink work).

dennyglee commented 2 years ago

I would love to add a Delta Source and Sink for Apache Heron (perhaps we can collaborate with the equivalent Flink work).

Absolutely! Please ping me via the Delta Users Slack channel and let's find a time to chat on this, eh?! Glad to help see if we can leverage existing work for Apache Heron.

YannByron commented 2 years ago

Merge-On-Read Mode?

YannByron commented 2 years ago

Will Index mechanism be considered?For columns specified by user, build index to accelerate query/update/delete operation.

YannByron commented 2 years ago

any possible to use maven to manage project instead sbt ? ^ . ^

ericbellet commented 2 years ago

Hi guys, I have a question related to the roadmap of CDF. When will be published as open-source? Thanks in advance.

gauravbrills commented 2 years ago

Are there any plans to Open source FSCK , its a pain otherwise to repair large tables if you accidently delete something in s3 .

dennyglee commented 2 years ago

Hi @gauravbrills we had not considered open-sourcing FSCK due to the limited asks for this particular functionality. Saying this, while there is certainly value to this, perhaps we can chat on the Delta Users slack about the deletion scenario you are running into. Thanks!

gauravbrills commented 2 years ago

Hi @gauravbrills we had not considered open-sourcing FSCK due to the limited asks for this particular functionality. Saying this, while there is certainly value to this, perhaps we can chat on the Delta Users slack about the deletion scenario you are running into. Thanks!

Sure thanks will check there .. For now just did a delete of that partition and loaded .

ashokblend commented 2 years ago

@dennyglee can we consider stats collection of delta lake files, for dataskipping, as part of this roadmap.

dennyglee commented 2 years ago

@dennyglee can we consider stats collection of delta like files, for dataskipping, as part of this roadmap.

Great call out @ashokblend - we will consider it though cannot commit to this yet as we need prioritize / capacity plan. Consider voting @ashokblend 's comment so we can better ascertain your asks, eh?!

cosmincatalin commented 2 years ago

With Spark 3.2 available in Databricks, shouldn't support for it be considered sooner as it hinders upgrading?

dennyglee commented 2 years ago

With Spark 3.2 available in Databricks, shouldn't support for it be considered sooner as it hinders upgrading?

HI @cosmincatalin - yes, per #733 we are actively working on this and will update these threads as soon as we determine the timeline for Delta 1.1. HTH!

felipecoxa commented 2 years ago

Have you any plans/date to supporting OPTIMIZE ZORDER BY in Open-source version?

dennyglee commented 2 years ago

Hey @felipecoxa - yes, this is something we're definitely considering. Due to the amount of work this would entail, we're still determining the timeline on when we could work on this.

pdonath commented 2 years ago

Do you have plans to implement mulit-part checkpoint writing in the open-source Delta Lake? It is defined in the protocol and works for Databricks version, but as I see, open-source org.apache.spark.sql.delta.Checkpoints always writes a checkpoint to a single file.

dennyglee commented 2 years ago

Do you have plans to implement mulit-part checkpoint writing in the open-source Delta Lake? It is defined in the protocol and works for Databricks version, but as I see, open-source org.apache.spark.sql.delta.Checkpoints always writes a checkpoint to a single file.

Great question @pdonath - we don't have any plans to address multi-part checkpoint writing yet. Saying this, could you please provide the context on your scenario so we can get feedback from the community for prioritization purposes. If you prefer, you can also slack me directly on the Delta Users slack. HTH!

pdonath commented 2 years ago

Do you have plans to implement mulit-part checkpoint writing in the open-source Delta Lake? It is defined in the protocol and works for Databricks version, but as I see, open-source org.apache.spark.sql.delta.Checkpoints always writes a checkpoint to a single file.

Great question @pdonath - we don't have any plans to address multi-part checkpoint writing yet. Saying this, could you please provide the context on your scenario so we can get feedback from the community for prioritization purposes. If you prefer, you can also slack me directly on the Delta Users slack. HTH!

Thank you @dennyglee for the answer. I use Spark structured streaming with Delta Lake for aggregating events in time. In my case, time of a single Spark micro batch should be more less between 15 seconds and 2 minutes. After a few months of working, Delta Lake checkpoints are a bottleneck. The size of a single checkpoint is ~120 MB (after files compaction). When I see into a micro batch details I can see:

I decreased parquet row group size and it makes reading the latest checkpoint faster (it can be better parallelized). Now it takes ~10 seconds (20% of the whole micro batch is not perfect, but better). However, I'm not able to do anything to improve writing a new checkpoint. Writing multi-part checkpoints would probably help, especially if I was able to control somehow the number of parts.

praateekmahajan commented 2 years ago

Ability to perform custom Add and/or Delete commits directly in the Delta Log. This Slack thread on Delta Slack Community has some more context.

dennyglee commented 2 years ago

Ability to perform custom Add and/or Delete commits directly in the Delta Log. This Slack thread on Delta Slack Community has some more context.

Thanks @praateekmahajan - could you please create an issue in this GitHub repo so that we can discuss this more fully? Thanks!

sa255304 commented 2 years ago

@dennyglee : Could you guys consider adding Spark's Dynamic partition overwrite functionality to Delta Lake as well.

It is very important feature while running backfills for batch jobs. replaceWhere always requires a condition.

dennyglee commented 2 years ago

Thanks @sa255304 - we will be publishing the proposed 2022H1 roadmap by the end of the month. Will definitely take into account of your request. Saying this, I'm curious - would the Delta Lake 1.1 arbitrary replaceWhere help for your scenario?

sa255304 commented 2 years ago

@dennyglee : Thanks for considering the request. No arbitrary repalceWhere also does n't help. As it still requires me to write a condition.

dennyglee commented 2 years ago

Closing this issue as we can begin discussions in #920 - thanks!