Closed marmbrus closed 4 years ago
@marmbrus is this feature available in 0.3.0 release? If not, will it be available anytime soon?
Let me know if I can be of any help developing any features for this project. is this something need to be ported from data bricks?
@marmbrus Any updates on this issue?
@marmbrus if we generate the manifest based on Delta transaction log of the last version, does that works with Athena? My understanding is the last version should have all the files that required to create the manifest. I m banking on this approach until its available in Delta Lake.
Yes, if you generate a manifest using the transaction log that should work with Athena.
@marmbrus thanks for confirming
We are working on pushing our existing manifest generation code and tests to OSS..
Right now, I m developing the code to generate that manifest. I will probably use it temporarily until its available in Delta Lake. please let me know if you have any ETA.
Some of our Delta Lake tables got partition columns. do you know if the manifest generation works w/ partition columns as well?
@kkr78 yeah, our manifest generation will work with partition columns. for partitioned tables, the manifest files are itself partitioned by the same columns, so query (at least in Presto/Athena) will read manifests for only the partitions that it will query.
work with Athena.
So, Today we can't query in Presto (PrestoDB or PrestoSQL), right?
When #232 is resolved, presto can query delta lake?
For anyone looking for a workaround this is what i'm using for now : https://gist.github.com/FabioBatSilva/d6b168a01cf4ba991d9e77881c5ea1e5
@FabioBatSilva So we can use SymlinkManifestWriter
right now to make Presto compatible manifest files? ;)
@FabioBatSilva So we can use
SymlinkManifestWriter
right now to make Presto compatible manifest files? ;)
@cozos Yes, That is what i'm using to get data into AWS Athena. But keep in mind this is something I've put together in a few hours to test out Delta Lake + Athena.
@FabioBatSilva Do you mind telling me about the manifest workflow? Do you generate a new manifest file everytime the Delta Transaction Log is modified?
@cozos See the sample python code below that I developed to generate manifest based on Delta Transaction Log. Basically the code traverses through all the JSON files in _delta_log directory, identifies the list of parquet files and writes them to symlink text. The manifest is regenerated when the Transaction Log is modified. This code only works on s3.
Disclaimer: This code is not production-ready, not fully tested, this is just to give an idea of how it can be implemented.
https://github.com/kkr78/delta-util
The Athena table points to _symlink directory where symlink.txt files are located. For partitions, generate a symlink.txt file for each partition.
Hey all, we are open-sourcing our Databricks Delta's manifest generation code in this PR https://github.com/delta-io/delta/pull/250 This PR contains the core functionality to generate the manifest, but not the public APIs yet. Future PRs are going to add Scala, Python and SQL APIs to generate the manifest.
Thank you, everyone, for showing so much interest in Delta Lake and its compatibility with other engines.
@tdas is this PR sufficient scope for a point release?
By "point release", do you mean patch release ... as in 0.4.1? The current plan is to release it in 0.5.0 which is expected to be some time in december.
Hi. Sorry I did not see the 7 Dec milestone for the 0.5.0 point release.
The core manifest generation code for symlink style manifest generation has been merged in https://github.com/delta-io/delta/commit/b18ffba990ad251dcb41b2f6a5684fb855b7c412 We are currently working on the Scala/Python/SQL APIs for the manifest generation.
If someone would like to compile the project to test into the AWS Athena, the command to export the Symlink is:
`import org.apache.spark.sql.delta._ import org.apache.spark.sql.delta.hooks.GenerateSymlinkManifest
val deltaLog = DeltaLog.forTable(spark, "s3a://....")
GenerateSymlinkManifest.generateFullManifest(spark, deltaLog)`
Table syntax:
CREATE EXTERNAL TABLE
.....( ....) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 's3://.../_symlink_format_manifest/
This is the PR for public APIs for manifest generation - https://github.com/delta-io/delta/pull/262
Great stuff, looking forward to getting this and #262 in 0.5.0!
The public APIs were committed in https://github.com/delta-io/delta/commit/5b3e3ebd68be9be7825e6b2b7018a0731ebba7b0
Delta's use of MVCC causes external readers (presto, hive, etc) to see inconsistent or duplicate data. The
SymlinkTextInputFormat
allows these systems to read a set of manifest files, each containing a list of file paths, in order to determine which data files to read (rather that listing the files present on the file system).This issue tracks adding support for generating these manifest files from the Delta transaction log, automatically after each commit.
This feature would give support for reading Delta tables from Presto. Hive would require some addition work on the Hive side, as Hive does not use the file extension to determine the final
InputFormat
to use to decode the data (and as such interprets the files incorrectly as text).