Azure / spark-cdm-connector

MIT License
76 stars 33 forks source link

Databricks 10.5/Spark 3.2.1: java.lang.NoClassDefFoundError: org/apache/spark/sql/sources/v2/ReadSupport #94

Closed NitinSingh12 closed 2 years ago

NitinSingh12 commented 2 years ago

While running below code in databricks (Databricks Runtime Version = 10.5 (includes Apache Spark 3.2.1, Scala 2.12)) - Library which we have installed is not working with current cluster but working with 6.4 runtime version.

Error: java.lang.NoClassDefFoundError: org/apache/spark/sql/sources/v2/ReadSupport

Write a CDM entity with Parquet data files, entity definition is derived from the dataframe schema

d = datetime.strptime("2015-03-31", '%Y-%m-%d') ts = datetime.now() data = [ ["a", 1, True, 12.34, 6, d, ts, Decimal(1.4337879), Decimal(999.00), Decimal(18.8)], ["b", 1, True, 12.34, 6, d, ts, Decimal(1.4337879), Decimal(999.00), Decimal(18.8)] ]

schema = (StructType() .add(StructField("name", StringType(), True)) .add(StructField("id", IntegerType(), True)) .add(StructField("flag", BooleanType(), True)) .add(StructField("salary", DoubleType(), True)) .add(StructField("phone", LongType(), True)) .add(StructField("dob", DateType(), True)) .add(StructField("time", TimestampType(), True)) .add(StructField("decimal1", DecimalType(15, 3), True)) .add(StructField("decimal2", DecimalType(38, 7), True)) .add(StructField("decimal3", DecimalType(5, 2), True)) )

df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)

Creates the CDM manifest and adds the entity to it with gzip'd parquet partitions

with both physical and logical entity definitions

(df.write.format("com.microsoft.cdm") .option("storage", storageAccountName) .option("manifestPath", container + "/implicitTest/default.manifest.cdm.json") .option("entity", "TestEntity") .option("format", "parquet") .option("compression", "gzip") .save())

Append the same dataframe content to the entity in the default CSV format

(df.write.format("com.microsoft.cdm") .option("storage", storageAccountName) .option("manifestPath", container + "/implicitTest/default.manifest.cdm.json") .option("entity", "TestEntity") .mode("append") .save())

readDf = (spark.read.format("com.microsoft.cdm") .option("storage", storageAccountName) .option("manifestPath", container + "/implicitTest/default.manifest.cdm.json") .option("entity", "TestEntity") .load())

readDf.select("*").show()

Need your help here to direct us to right library so that we can create entity tables in databricks.

kcheeeung commented 2 years ago

See issue #92.