PicnicSupermarket / diepvries

The Picnic Data Vault framework.
https://diepvries.picnic.tech
MIT License
126 stars 15 forks source link

Possible to extend to data vault 2.0 with Spark and/or BigQuery as target? #31

Open dkapitan opened 2 years ago

dkapitan commented 2 years ago

Seems like I am the first one to post an issue here :smile:. Very interested in diepvries, looking into it for one of my Dutch clients. We are considering the data vault 2.0 pattern for the datalake.

As described in this article there are a couple of issues that need solving for using data vault for datalake technologies like Databricks/Delta Lake or Google BigQuery. Most fundamental one is limited or non-existent functionality for constraints and foreign keys.

The article proposes to use deterministic hashkeys, which I can see working. So my question is, do you think it is doable to extend diepvries such that it can generate DDL for the target schema of the Raw Vault (that's our current scope, so not full DV2.0) for different target engines, most notably Databricks / Delta Lake and BigQuery.

Any suggestions are welcome, and if this is a workable solution we would be willing and able to contribute to diepvries.

image

matthieucan commented 2 years ago

Hey @dkapitan, thanks for your interest in diepvries!

Correct me if I'm wrong, but what you'd like to have is the ability for diepvries to generate SQL loading statement for other dialects than Snowflake, e.g. BigQuery and Databricks? Do you also have other architectural changes in mind, e.g. some structures in the raw vault that can't be loaded by diepvries?

To keep you up-to-date with our vision for the future of this framework, we are currently experimenting with using SQLalchemy for all generated SQL statements, in order to have other SQL dialects. Using SQLalchemy doesn't automatically solve all problems, as some SQL statements can differ between dialects and are not always 100% supported (e.g. MERGE statements), but we believe it would be a good first step in that direction. Once we're there, generating DDL statements (on top of DML statements for loading data) for target structures should be possible, as SQLalchemy allows that.

dkapitan commented 2 years ago

Hi @matthieucan, thanks for your response. Regarding your questions:

Correct me if I'm wrong, but what you'd like to have is the ability for diepvries to generate SQL loading statement for other dialects than Snowflake, e.g. BigQuery and Databricks?

That is indeed what I am looking for.

Do you also have other architectural changes in mind, e.g. some structures in the raw vault that can't be loaded by diepvries?

For now, I just want to follow along with diepvries's existing architecture. As far as I understand it follows the current Data Vault 2.0 specification. Once we have a generic SQL back-end interface, we can start thinking about other features.

Going with SQL Alchemy makes perfect sense to me. Please let me know if there is any way I could help. For example, generating an overview with the key differences between dialects seems useful for refining our needs. I have worked a lot with BigQuery, and getting into Apache Spark more and more, so I could help with those.

matthieucan commented 2 years ago

Correct me if I'm wrong, but what you'd like to have is the ability for diepvries to generate SQL loading statement for other dialects than Snowflake, e.g. BigQuery and Databricks?

That is indeed what I am looking for.

Understood!

Going with SQL Alchemy makes perfect sense to me. Please let me know if there is any way I could help. For example, generating an overview with the key differences between dialects seems useful for refining our needs. I have worked a lot with BigQuery, and getting into Apache Spark more and more, so I could help with those.

While we can't commit to any timeline for this implementation, you might get inspired by this file in a proof-of-concept branch: https://github.com/PicnicSupermarket/diepvries/blob/matthieucan/sqlalchemy/src/diepvries/test_sqlalchemy.py We intend to implement this for Snowflake as a first step. I think playing with those queries to estimate the efforts to make them compatible with BigQuery and Spark will be very valuable!

isoctcolo commented 3 months ago

Hello all one question how is the porting to SQL Alchemy - I wanted to test it PostgreSQL database.

matthieucan commented 3 months ago

Hello all one question how is the porting to SQL Alchemy - I wanted to test it PostgreSQL database.

Hi, unfortunately this has not been implemented