Azure / spark-cdm-connector

MIT License
76 stars 33 forks source link

Delta lake / lakehouse support #64

Open TissonMathew opened 3 years ago

TissonMathew commented 3 years ago

Any plans to support delta lake? Keep the CDM specific manifests / metadata etc. in ADLS Gen 2 and data in delta. Also, this removes a lot of operational burden including partitioning etc.

I like CDM standard schemas, approaches etc. but operationalizing CDM data for interactive queries is costly (e.g. copying data into cosmos db, azure search etc.). Delta compute & storage optimization for interactive queries is cost effective without impacting performance E.g. Power BI or a React app ...

CDM + Delta could be an excellent cost effective alternative to Snowflake.

For example:

Creates the CDM manifest and adds the entity to it with delta with both physical and logical entity definitions (df.write.format("com.microsoft.cdm") .option("storage", storageAccountName) .option("manifestPath", container + "/implicitTest/default.manifest.cdm.json") .option("entity", "TestEntity") .option("format", "delta") .save())

TissonMathew commented 3 years ago

We are currently in production with delta and CDM. We had to work around a few things to make it work but the perf & scale incredible with delta lake (lakehouse architecture - both streaming/real-time and batch). The best of both worlds, CDM add meaning to data in delta.

stevenwilliamsmis commented 3 years ago

@TissonMathew would you mind sharing what you had to do to make this work with delta?

SQLArchitect commented 3 years ago

Please Share @TisonMathew

On Mon, Feb 8, 2021, 1:26 PM stevenwilliamsmis notifications@github.com wrote:

@TissonMathew https://github.com/TissonMathew would you mind sharing what you had to do to make this work with delta?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Azure/spark-cdm-connector/issues/64#issuecomment-775349198, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHWPJKYW3E5BEY6Y5DFIRITS6AUFNANCNFSM4VPU5HPA .

bissont commented 3 years ago

Hi @TissonMathew,

Yes, @euangms and I are also interested in your use case. Would you be interested in getting in on a call to discuss?

TissonMathew commented 3 years ago

@bissont Sure. Happy to connect!

don4of4 commented 3 years ago

@TissonMathew do you mind describing your approach? Did you write your own connector?

bitsofinfo commented 3 years ago

@TissonMathew please share?

colinrippeyfinarne commented 3 years ago

Hi @TissonMathew I am currently building out an Azure Synapse PoC and we are planning on using Delta Lake as our primary storage format. I would love to know how you've been able to get the ADLS Folders and Files containing the Delta Lake to "co-exist" with the CDM Folders & Files (if that is in fact what you've done).

Can you provide any details please?

ralphke commented 3 years ago

Would be a great addition to have native CDM support for Delta files. Is anybody actively working on this?

drinkingbird commented 2 years ago

Hi @TissonMathew I'm wondering what you achieve by writing delta files using the CDM structure. While there is an overlap in features (such as data typing in a data lake / partitioning / history), they are quite different in implementation. The benefit of CDM is it's accessibility and numerous readers. Pretty much everything can read csvs and it has an easier barrier for entry to interact with the data, even with custom code.

A mixed approach would create additional metadata files to manage and limit readers to only those which are compatible with delta, and the only CDM + delta reader/writer would be this one.

I feel it's quite simple to build a delta lake using the CDM as one of the sources, to which you can apply your own businesses governance and requirements. In this case, your readers would not require this connector.

I'm sorry but I'm missing the advantage of this. Can you please elaborate?

ralphke commented 2 years ago

@drinkingbird CDM serves a different purpose than what delta is made up for. One of the key aspects of CDM is the ability to express relationships between different delta files as well as rich descriptions of the context of each file and columns as they exist within the delta format. Also having well documented industry specific data models now in Synapse workbench would allow an easy editing and maintenance of CDM models. This is today possible with parquet files but not yet with delta files.

drinkingbird commented 2 years ago

@ralphke Excellent. Thank you for the response! That clears it up.

don4of4 commented 2 years ago

We hear that Delta is the number one requested feature -- so I know there must be irons in this fire.

That said, Delta support for CDM/SyMS is a must have for our company's big data plans on Azure Data Services.

Speaking on behalf of Hitachi Solutions, the current limitations around Delta impair the serious use of directly exported Dynamics CE / FO data. We are finding that out need to ingest it into Delta (or dedicated pool) to bring that data to the same plane as the rest of the Enterprise sources materially errodes the value proposition of the feature.

Similarly, we can't seriously consider building models with the 3NF Industry Models without the performance that we currently get from Delta. This is especially true given the industry models still must be transformed/converted to Kimball/dimensional marts.

bissont commented 2 years ago

I need to investigate, but it looks like there is native support for writing to the deta format now: https://github.com/delta-io/connectors/issues/85

ralphke commented 2 years ago

I need to investigate, but it looks like there is native support for writing to the deta format now: delta-io/connectors#85

This looks like the team is in a very early design phase for this connector.

NitinSingh12 commented 2 years ago

We are working on the POC to integrate delta lake as a primary storage resource - But looking out for options to read the dynamic 365 data which is stored in ADLS - wanted to use autoloader functions and create entity tables and store in delta lake.

Please help if you have any resources. thank you.

NitinSingh12 commented 2 years ago

@TissonMathew Hi, would you mind sharing your approach, Pls ? Thank you.

don4of4 commented 2 years ago

@NitinSingh12 My firm, Hitachi Solutions has a commercial connector under final development and testing with a very large customer -- trickle loader for Finance and Operations, and dataverse. You can reach out to me at dscott@hitachisolutions.com if you are interested.

TissonMathew commented 2 years ago

@NitinSingh12 please contact suresh.velga@skypointcloud.com

NitinSingh12 commented 2 years ago

@TissonMathew @don4of4 Thank you for providing info, However my team is able to build solution and able to perform CDC & Data load from M 365 with autoloader functions into Delta format. Thank you.