delta-io / delta-examples

Delta Lake examples
Apache License 2.0
208 stars 76 forks source link

A draft to make Dominiques parser more generic #2

Closed danp11 closed 4 years ago

danp11 commented 4 years ago

Hi Matthew

I red in your book some examples about Medellin :-) I guess you have lived there? I also did in 2003-2004. I went thru there with my motorbike and loved the people, the city and the country so much that I stayed there almost a year. When I get close to my retirement in 25 years or so I'm moving back :-)

Being a newbie in Scala/Spark I I find it a bit hard how to organize  the code. I've taken almost all the code from Dominique parser and added some code that hopefully can make it even more generic.

In this first PR I just want to ask if you have some time over and have  a quick look and see if there is something I can work more with to educate myself. It is no problem if you don't have the time or simply think it doesn't bring any value to the example repo haha. My hope however is that we can get some code that is easy to follow and maintain and mostly it should be easy to plug in new "event types" In this example one can easily plug in for example a "Order handler" that could be from another source than from a file. 

There is alot of tests missing etc but I just want to get a first opinion from you.
With this current code there is no need for a "bronze table" and I might miss something but if feels a little overkill if you have all the data close to you and in known locations?

Hopefully the code should be easy to follow and any inputs from you of what to change/how to better structure it would be very appreciated. But no worries if you can't!

Take care,

/Dan

danp11 commented 4 years ago

Hi again, doing some weekend coding and have done quite alot of changes. Ill close this PR and open a new one once I have the code more structured. /Dan

MrPowers commented 4 years ago

@danp11 - Sounds great, I'll be looking out for the new pull request. Looking forward to checking out the code!

MrPowers commented 4 years ago

@danp11 - just took a look at the code and looks like you're off to a great start!!

danp11 commented 4 years ago

Hi again

I have done a first "best effort" :-) of a generic delta lake parser. Let me know if you think it could be something of interest for others and I'll send a PR to your delta lake examples repo. Enjoy your weekend! /Dan

https://github.com/danp11/spark-delta-pipeline

MrPowers commented 4 years ago

@danp11 - Yea, this would make a great pull request. Hopefully we can collaborate on a blog post after working out all the code details!

Probably better to use compactFiles instead of compressFiles. Compress typically refers to the file compression format (snappy). We'll be able to add some more tests once the code is in the delta-examples repo. We'll also be able to add some additional examples for some of the other advanced stuff Dominique covered in his talk (I still don't understand a lot of that yet).

Keep up the great work!

danp11 commented 4 years ago

@MrPowers

Ok, nice. I'll be adding more test and will go thru the code in more detail and see if there is more that can be abstracted away. In a week or two I should have it ready for a PR.