Mimetis / ProjectY

Project Y is a straightforward Landing Zones automated deployment tool dedicated to data processing.
MIT License
7 stars 5 forks source link

Synapse Option #7

Open Mimetis opened 3 years ago

Mimetis commented 3 years ago

Idea

Adding the option to deploy an engine, using Synapse instead of Databricks / ADF

Today

For now, we only have the option to deploy an engine using Databricks:

image

Expectation

Having the same level of integration than Databricks, but using Synapse.

mariekekortsmit commented 3 years ago

I spent some time looking into the manual deployment of Synapse in this context. A few things I found out that might be useful regarding the notebooks that are used in Databricks (main.ipynb and common.ipynb) that need transferring to Synapse:

Keyvault connection:

In Databricks/common.ipynb you are getting the secret for the service principal: client_secret = dbutils.secrets.get(keyvault, "clientsecret") In Synapse, you need to add your Keyvault als a linked service, afterwards in Synapse/common.ipynb you can do the same by: client_secret = TokenLibrary.getSecret("kvengzxq4fl", "clientsecret")

Connecting to the ADLS:

The following lines in Databricks/common.ipynb should be obsolete in Synapse/common.ipynb because in the Synapse case you want to use the ADLS that is Synapses workspace default storage.

accountName = engine["storageName"] # from engine.storageName
accountKey = "dsLake-" + engine["storageName"] # from engine.storageName

# Get the secret value
accountKeyValue = dbutils.secrets.get(keyvault, accountKey)

# set the token for accessing input and output path
spark.conf.set("fs.azure.account.key." + accountName + ".dfs.core.windows.net", accountKeyValue)

Running Synapse/common.ipynb from Synapse/main.ipynb

In Databricks/main.ipynb you run the common notebook with: %run "./common" According to documentation (https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-development-using-notebooks?tabs=preview#notebook-reference) in Synapse you should be able to use %run as well, however the documentation also gives: image which leads to believe this will not work when using it in a pipeline (which is desired).

Running a notebook from another notebook in Synapse does work when you use: mssparkutils.notebook.run("common") and to check it did run you could use in Synapse/main.ipynb:

exitVal = mssparkutils.notebook.run("common")
print (exitVal)

when adding something like: mssparkutils.notebook.exit("Execution of common notebook is finished") to the last cell in Synapse/common.ipynb you can see that the notebook is executed. However, the functions exposed in Synapse/common.ipynb are not available from Synapse/main.ipynb. So it seems we don't get the context from that notebook back.

Proposed workaround: