Azure / spark-cdm-connector

MIT License
75 stars 32 forks source link

Does this connector work for open source Spark? #88

Closed RedwanAlkurdi closed 2 years ago

RedwanAlkurdi commented 2 years ago

I am using Pyspark 3.2 and have used the following code to install the dependencies. However, it doesn't work and I keep getting the following:

CODE:

spark = (SparkSession.builder .appName("NewApp") .master("local[3]")
.config("spark.jars.packages","org.apache.hadoop:hadoop-azure:3.3.1,com.microsoft.azure:spark-cdm-connector:0.19.0") .config("fs.azure.account.auth.type.abfswales1.dfs.core.windows.net","SharedKey") .config("fs.azure.account.key..dfs.core.windows.net","SharedKeyFromAzurePortal") .getOrCreate())

readDf = (spark.read.format("com.microsoft.cdm") .option("storage", storageAccountName) .option("manifestPath", "Dataverse-storage" + "/model.json") .option("entity", "account") .load())

and I am getting the following error:

Py4JJavaError: An error occurred while calling o59.load. : java.lang.NoClassDefFoundError: org/apache/spark/sql/sources/v2/ReadSupport at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:756) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:473) at java.net.URLClassLoader.access$100(URLClassLoader.java:74) at java.net.URLClassLoader$1.run(URLClassLoader.java:369) at java.net.URLClassLoader$1.run(URLClassLoader.java:363) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:362) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at java.lang.ClassLoader.loadClass(ClassLoader.java:351)

srichetar commented 2 years ago

Hi @RedwanAlkurdi. Are you using the connector on Azure Synapse or Azure Databricks?

RedwanAlkurdi commented 2 years ago

Hi @srichetar. Nope, I am using an on-prem dev environment.

srichetar commented 2 years ago

The code is open-sourced and works with spark3. Can you try to build it locally and use the jar then?

RedwanAlkurdi commented 2 years ago

Sure, I’ll try it out and let you know

RedwanAlkurdi commented 2 years ago

@srichetar

well, it worked after I built it. However, it doesn't work with an on-prem dev environment, which is quite strange actually, because read-only does not have any effect on the managed identities. Moreover, it is reading from the snapshots anyway, which will not interfere with any Dataverse dataflows/links.

Py4JJavaError: An error occurred while calling o60.load. : java.lang.Exception: Managed identities only supported on Synapse or Databricks at com.microsoft.cdm.utils.CDMOptions.(CDMOptions.scala:49) at com.microsoft.cdm.CDMIdentifier.(CDMIdentifier.scala:10)

is there anyway to change this behaviour?, because it seems internal to the connector

srichetar commented 2 years ago

Please use credential based authentication to work with on-prem dev environment.

RedwanAlkurdi commented 2 years ago

Yeah, I figured that out, just forgot to close it, closing the issue now. Thank you :)