AbsaOSS / pramen

Resilient data pipeline framework running on Apache Spark
Apache License 2.0
23 stars 2 forks source link

Add support of delta table for bookkeeper #475

Closed lukas-zeman-ABSA closed 2 months ago

lukas-zeman-ABSA commented 3 months ago

Add support of delta table for bookkeeper. Could be used to maintain metastore in databricks.

https://github.com/AbsaOSS/pramen/blob/c0bc31219fdcbfe9398cf6a2f0e414278712ec55/pramen/core/src/main/scala/za/co/absa/pramen/core/bookkeeper/Bookkeeper.scala#L110

yruslan commented 3 months ago

We had such an implementation, actually 😄. It was quite slow, so we removed it. But it was a couple of years ago. Maybe now is a good time to revive it.

yruslan commented 2 months ago

Found classes for Delta. I want to restore them in next Pramen version. Just, currently, it uses Delta paths, not tables. This is because it requires several different subpaths to save different stuff. Do you want to add Delta Lake table support or a path is fine?

lukas-zeman-ABSA commented 2 months ago

Well maybe we could make it work at databricks with just path, but saveAsTable would be much better. (It would improve speed and also allow us to store this data in databricks managed tables)

yruslan commented 2 months ago

Got it, will add support for tables

yruslan commented 2 months ago

Just want also to clarify that Pramen is going to use several tables for bookkeeping, So when this is implemented, you can specify the database and table prefix for Delta Table configuration.

Somethting like:

pramen {
  bookkeeping.enabled = true
  bookkeeping.delta.database = "my_db"
  bookkeeping.delta.table.prefix = "bk_"
}

Let me know if this is okay for you.

lukas-zeman-ABSA commented 2 months ago

chcecked the implementation. Yes this would work totally fine, thanks. Theoretically database here means "catalog.schema" but will work :)