Azure / spark-cdm-connector

MIT License
75 stars 32 forks source link

Spark-cdm-connector 0.19.1 - java.lang.NoClassDefFoundError: org/apache/spark/sql/sources/v2/ReadSupport #74

Closed islamtg closed 2 years ago

islamtg commented 3 years ago

I am following this example here and getting the following error when I run this portion

Creates the CDM manifest and adds the entity to it with gzip'd parquet partitions

with both physical and logical entity definitions

(df.write.format("com.microsoft.cdm") .option("storage", StorageAccount) .option("manifestPath", "/powerbi/adlsgen2isleghaz/covid19datasetmlDataset/default.manifest.cdm.json") .option("entity", "TestEntity") .option("format", "parquet") .option("compression", "gzip") .save())

java.lang.NoClassDefFoundError: org/apache/spark/sql/sources/v2/ReadSupport

Py4JJavaError Traceback (most recent call last)

in 1 # Creates the CDM manifest and adds the entity to it with gzip'd parquet partitions 2 # with both physical and logical entity definitions ----> 3 (df.write.format("com.microsoft.cdm") 4 .option("storage", StorageAccount) 5 .option("manifestPath", "/powerbi/adlsgen2isleghaz/covid19datasetmlDataset/default.manifest.cdm.json") /databricks/spark/python/pyspark/sql/readwriter.py in save(self, path, format, mode, partitionBy, **options) 1132 self.format(format) 1133 if path is None: -> 1134 self._jwrite.save() 1135 else: 1136 self._jwrite.save(path) /databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in __call__(self, *args) 1302 1303 answer = self.gateway_client.send_command(command) -> 1304 return_value = get_return_value( 1305 answer, self.gateway_client, self.target_id, self.name)
islamtg commented 3 years ago

After changing the version of the cluster to use spark 2.4 I see that I still get issue with trying to create a new manifest.cdm.json or reading an exisitng manifest.cdm.json I now get this issue -1 error code: null error message: InvalidAbfsRestOperationExceptionjava.net.UnknownHostException: https

islamtg commented 3 years ago

I got rid of the https:// portion and I now get this issue: HEAD https://dlacopdemocomm02.dfs.core.windows.net/power-bi-cdm/powerbi-dataflow/WideWorldImporters/model.json?timeout=90

Py4JJavaError Traceback (most recent call last)

in 2 .option("storage", storageAccountName) 3 .option("manifestPath", "power-bi-cdm/powerbi-dataflow/WideWorldImporters/model.json") ----> 4 .option("entity", "Sales Customers") 5 #.option("appId", appid) 6 #.option("appKey", appkey) /databricks/spark/python/pyspark/sql/readwriter.py in load(self, path, format, schema, **options) 170 return self._df(self._jreader.load(self._spark._sc._jvm.PythonUtils.toSeq(path))) 171 else: --> 172 return self._df(self._jreader.load()) 173 174 @since(1.4) /databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __call__(self, *args) 1255 answer = self.gateway_client.send_command(command) 1256 return_value = get_return_value( -> 1257 answer, self.gateway_client, self.target_id, self.name) 1258
srichetar commented 3 years ago

Hi, you need to give Storage Blob Data Contributer access to the identity.

islamtg commented 3 years ago

@srichetar The account already has storage blob data contributor access to the identity. image

srichetar commented 3 years ago

Please email asksparkcdm@microsoft.com if you are still facing the issue.

absognety commented 3 years ago

I faced this issue when I was using spark-cdm connector 0.19.1 with databricks runtime 8.x - they are incompatible with each other, I started using databricks 6.4 which fixed this issue.

srichetar commented 2 years ago

databricks 6.4 which fixed this issue.