delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.18k stars 1.62k forks source link

SymlinkTextInputFormat Manifest Generation for Presto/Athena read support #76

Closed marmbrus closed 4 years ago

marmbrus commented 4 years ago

Delta's use of MVCC causes external readers (presto, hive, etc) to see inconsistent or duplicate data. The SymlinkTextInputFormat allows these systems to read a set of manifest files, each containing a list of file paths, in order to determine which data files to read (rather that listing the files present on the file system).

This issue tracks adding support for generating these manifest files from the Delta transaction log, automatically after each commit.

This feature would give support for reading Delta tables from Presto. Hive would require some addition work on the Hive side, as Hive does not use the file extension to determine the final InputFormat to use to decode the data (and as such interprets the files incorrectly as text).

kkr78 commented 4 years ago

@marmbrus is this feature available in 0.3.0 release? If not, will it be available anytime soon?

kkr78 commented 4 years ago

Let me know if I can be of any help developing any features for this project. is this something need to be ported from data bricks?

millecker commented 4 years ago

@marmbrus Any updates on this issue?

kkr78 commented 4 years ago

@marmbrus if we generate the manifest based on Delta transaction log of the last version, does that works with Athena? My understanding is the last version should have all the files that required to create the manifest. I m banking on this approach until its available in Delta Lake.

marmbrus commented 4 years ago

Yes, if you generate a manifest using the transaction log that should work with Athena.

kkr78 commented 4 years ago

@marmbrus thanks for confirming

tdas commented 4 years ago

We are working on pushing our existing manifest generation code and tests to OSS..

kkr78 commented 4 years ago

Right now, I m developing the code to generate that manifest. I will probably use it temporarily until its available in Delta Lake. please let me know if you have any ETA.

Some of our Delta Lake tables got partition columns. do you know if the manifest generation works w/ partition columns as well?

tdas commented 4 years ago

@kkr78 yeah, our manifest generation will work with partition columns. for partitioned tables, the manifest files are itself partitioned by the same columns, so query (at least in Presto/Athena) will read manifests for only the partitions that it will query.

chethanuk commented 4 years ago

work with Athena.

So, Today we can't query in Presto (PrestoDB or PrestoSQL), right?

When #232 is resolved, presto can query delta lake?

FabioBatSilva commented 4 years ago

For anyone looking for a workaround this is what i'm using for now : https://gist.github.com/FabioBatSilva/d6b168a01cf4ba991d9e77881c5ea1e5

cozos commented 4 years ago

@FabioBatSilva So we can use SymlinkManifestWriter right now to make Presto compatible manifest files? ;)

FabioBatSilva commented 4 years ago

@FabioBatSilva So we can use SymlinkManifestWriter right now to make Presto compatible manifest files? ;)

@cozos Yes, That is what i'm using to get data into AWS Athena. But keep in mind this is something I've put together in a few hours to test out Delta Lake + Athena.

cozos commented 4 years ago

@FabioBatSilva Do you mind telling me about the manifest workflow? Do you generate a new manifest file everytime the Delta Transaction Log is modified?

kkr78 commented 4 years ago

@cozos See the sample python code below that I developed to generate manifest based on Delta Transaction Log. Basically the code traverses through all the JSON files in _delta_log directory, identifies the list of parquet files and writes them to symlink text. The manifest is regenerated when the Transaction Log is modified. This code only works on s3.

Disclaimer: This code is not production-ready, not fully tested, this is just to give an idea of how it can be implemented.

https://github.com/kkr78/delta-util

The Athena table points to _symlink directory where symlink.txt files are located. For partitions, generate a symlink.txt file for each partition.

tdas commented 4 years ago

Hey all, we are open-sourcing our Databricks Delta's manifest generation code in this PR https://github.com/delta-io/delta/pull/250 This PR contains the core functionality to generate the manifest, but not the public APIs yet. Future PRs are going to add Scala, Python and SQL APIs to generate the manifest.

Thank you, everyone, for showing so much interest in Delta Lake and its compatibility with other engines.

seddonm1 commented 4 years ago

@tdas is this PR sufficient scope for a point release?

tdas commented 4 years ago

By "point release", do you mean patch release ... as in 0.4.1? The current plan is to release it in 0.5.0 which is expected to be some time in december.

seddonm1 commented 4 years ago

Hi. Sorry I did not see the 7 Dec milestone for the 0.5.0 point release.

tdas commented 4 years ago

The core manifest generation code for symlink style manifest generation has been merged in https://github.com/delta-io/delta/commit/b18ffba990ad251dcb41b2f6a5684fb855b7c412 We are currently working on the Scala/Python/SQL APIs for the manifest generation.

hudsondba commented 4 years ago

If someone would like to compile the project to test into the AWS Athena, the command to export the Symlink is:

`import org.apache.spark.sql.delta._ import org.apache.spark.sql.delta.hooks.GenerateSymlinkManifest

val deltaLog = DeltaLog.forTable(spark, "s3a://....")

GenerateSymlinkManifest.generateFullManifest(spark, deltaLog)`

Table syntax: CREATE EXTERNAL TABLE.....( ....) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 's3://.../_symlink_format_manifest/

tdas commented 4 years ago

This is the PR for public APIs for manifest generation - https://github.com/delta-io/delta/pull/262

joakibo commented 4 years ago

Great stuff, looking forward to getting this and #262 in 0.5.0!

tdas commented 4 years ago

The public APIs were committed in https://github.com/delta-io/delta/commit/5b3e3ebd68be9be7825e6b2b7018a0731ebba7b0