data-derp / small-exercises

6 stars 9 forks source link

[Content] Update Delta Lake Walkthrough #37

Open kelseymok opened 1 year ago

kelseymok commented 1 year ago

There are two new amazing notebooks from Databricks which will fit in very well here. The first one is similar to our demo that already exists, and the second is a new notebook which can be used as bonus material.

NOTE: In this issue, we'll update the OLD delta-lake-walkthrough


OLD CONTENT

Let's make sure that our Delta Lake exercise is working and up to date.

Context: this ticket is a revamp of the old one because we don't know what state the "updated" Delta Lake exercise is in so we'll first check it, then update it, and then we'll add a new Delta Lake exercise.

This is no longer relevant because the notebook is no longer at this url

kelseymok commented 1 year ago

Decide if we want to create this in a new repo. If we change this in the small-exercises repo, let's make sure to not interrupt the current running Tour

kelseymok commented 1 year ago

From Syed:

  1. Our CMD 19 and CMD 23 are showing errors in our notebok even though the data is available at the S3 location. No error in the latest Databricks notebook.
  2. The latest Databricks notebook does show visualizations. Our notebook does not
  3. ACID transactions section just shows the logs / describes history from loans_detla table, both in our notebook and databricks notebook. It should demo some ACID transactions. If the loans_delta table does, some notes about what is happening in the table should be written.
kelseymok commented 1 year ago

@syed-tw do you have the Databricks notebook that was supposed to be at https://www.databricks.com/notebooks/Demo_Hub-Delta_Lake_Notebook.html?utm_source=youtube&utm_medium=web&utm_campaign=7013f000000cVKYAA2? It redirects to the DBX homepage now.

syed-tw commented 1 year ago

I have the follow 4 files as in my workspace (compressed in .zip achieve)

Does this work? @kelseymok

Archive 2.zip

kelseymok commented 1 year ago

@syed-tw no these are our Delta Lake exercises - I'm referring to the delta lake demo in the link above. It seems like you did some research already and since that file no longer exists, I was hoping that you had a version that you had already downloaded because if we don't, we can't fix this.

kelseymok commented 1 year ago

This has now been updated with new content and new notebooks.

syed-tw commented 1 year ago

Comments / obervations while comparing delta-lake-walkthrough (in small-exercises repo) with 00-Delta-Lake-Introduction

  1. Delta Lake explanation in CMD 3 in 00-Delta-Lake-Introduction is much more detailed
  2. 00-Delta-Lake-Introduction needs ML Runtime (13.2 ML, Scala 2.12, Spark 3.4.0)
  3. 00-Delta-Lake-Introduction uses Contraint for Quality Check, delta-lake-walkthrough adds a new column to check it
  4. Clone Delta Tables in 00-Delta-Lake-Introduction is new

Rest all looks pretty much the same in both the notebooks

kelseymok commented 1 year ago

@syed-tw great to see that there's more content. Let's fold in those points (i-iv) from 00-Delta-Lake-Introduction (the new notebook takes precedence) to our notebook. We'll also need to update the cluster-creation process to use the right runtime (will create a task for that).

kelseymok commented 1 year ago

https://github.com/data-derp/documentation/issues/3 -> task for updating runtime.

syed-tw commented 1 year ago

Updating the list of changes as the Spark ML Runtime is not needed anymore.

  1. Delta Lake explanation in CMD 3 in 00-Delta-Lake-Introduction is much more detailed
  2. 00-Delta-Lake-Introduction uses Contraint for Quality Check, delta-lake-walkthrough adds a new column to check it
  3. Clone Delta Tables in 00-Delta-Lake-Introduction is new
syed-tw commented 1 year ago

Integrated the new notebooks (delta lake introduction and performance) into data-derp (small-exercises repo)