databricks-industry-solutions / hls-payer-mrf-sparkstreaming

Spark Structured Streaming for Payer MRF use case
Other
14 stars 6 forks source link

abfss via unity catalog storage credential & external storage support #14

Closed balbarka closed 1 month ago

balbarka commented 1 year ago

Unable to use unity catalog external location path. Failing path example:

source_data = "abfss://team-hls-ssa-es@hlsfieldexternal.dfs.core.windows.net/pt/raw/mth=202211/inr/2022_11_01_cigna_health_life_insurance_company_index.json"
source_cp = "abfss://team-hls-ssa-es@hlsfieldexternal.dfs.core.windows.net/pt/raw/mth=202211/inr/_checkpoint"

df = spark.readStream.format("com.databricks.labs.sparkstreaming.jsonmrf.JsonMRFSourceProvider").load(source_data)
spark.sql("DROP TABLE IF EXISTS main.pt_stage.inr")
dbutils.fs.rm(source_cp, True)
query = df.writeStream.outputMode("append") \
                      .format("delta") \
                      .option("truncate", "false") \
                      .option("checkpointLocation", source_cp) \
                      .table("main.pt_stage.inr")

cluster: https://adb-8590162618558854.14.azuredatabricks.net/?o=8590162618558854#setting/clusters/1021-151431-hs1h9l81/configuration

exception: Failure to initialize configurationInvalid configuration value detected for fs.azure.account.key

zavoraad commented 1 year ago

It looks like going against Azure directly requires additional permission configurations in Spark.

https://learn.microsoft.com/en-us/azure/databricks/external-data/azure-storage#access-adls-gen2-directly

tadtenacious commented 1 year ago

It looks like going against Azure directly requires additional permission configurations in Spark.

https://learn.microsoft.com/en-us/azure/databricks/external-data/azure-storage#access-adls-gen2-directly

This issue appears to be related to Unity Catalog. Per the link above:

Unity Catalog ignores Spark configuration settings when accessing data managed by external locations.

A lot of teams working in the healthcare space use Unity Catalog to be compliant with various security requirements and regulations related to PHI. It would be great to be able to process MRF files in the same databricks workspace as the rest of your protected healthcare data. I would anticipate that this would be brought up in the upcoming webinar, Building a Lakehouse for Healthcare: Unlocking Price Transparency.

zavoraad commented 1 year ago

Hi @tadtenacious, agree on your assessment that it is related to an infrastructure setup with Databricks and Unity Catalog. Having multi environments available we code working fine in dbfs, s3a, abfss...

Plan for the workshop is to focus on the specific technical and functional challenges in regards to price transparency.

We'll revisit this right after the workshop to round out the issue and provide a resolution in case other folks run into it. Stay tuned!

zavoraad commented 1 year ago

Doing some further research with UC + External locations, it appears that structured streaming is supported for single use clusters in LTS 11.3.

I will tag this as a version upgrade needed to Spark 3.3.0 for LTS 11.3

zavoraad commented 1 month ago

UC does not seem to support customs Spark streaming issues.