data-derp / small-exercises

6 stars 9 forks source link

Cluster Configuration - Specially for Delta Lake Optimization notebook to work #43

Closed syed-tw closed 1 year ago

syed-tw commented 1 year ago

Here it the cluster configs which I used to run the Delta Lake Optimazations Notebook

Image

kelseymok commented 1 year ago

image should be fixed - had to change the default machines too. @syed-tw can you check if this suits your needs?

kelseymok commented 1 year ago

We had to do a few things to get this to work.

  1. The Delta Lake Optimisation notebook required access to UC which is only available using Shared and Single_User clusters. We originally tried "Shared" which would allow pairing but we got an error:
    
    py4j.security.Py4JSecurityException: Method public scala.collection.immutable.Map com.databricks.backend.common.rpc.CommandContext.tags() is not whitelisted on class class com.databricks.backend.common.rpc.CommandContext

... which is apparently typical for "high-throughput" or shared clusters (as a safety precaution). After a lot of poking around and trying to set `spark.databricks.pyspark.enablePy4JSecurity` to false (didn't work, static value, not changeable) as an override, we decided to stick with the "Single User" cluster. 

As a result, we have created a new SINGLE USER policy just for this exercise and the trainers have made it clear that they were ok to facilitate the switching between the two clusters. New documentation will be created to list the settings required for this exercise.

2. The ML runtime requirement is because of some global init script which installs a bunch of, what we consider to be unnecessary ML libraries. Commented those out, and we can now re-use our 11.3 LTS.