delta-io / delta-sharing

An open protocol for secure data sharing
https://delta.io/sharing
Apache License 2.0
769 stars 172 forks source link

Getting Exception in thread "main" org.apache.spark.SparkClassNotFoundException: [DATA_SOURCE_NOT_FOUND] Failed to find the data source: deltaSharing. Please find packages at `https://spark.apache.org/third-party-projects.html`. while trying to read table as dataframe from a share. #428

Open mohika-knoldus opened 1 year ago

mohika-knoldus commented 1 year ago

import io.delta.sharing.client import org.apache.spark.sql.SparkSession

object ReadSharedData extends App {

val spark = SparkSession.builder() .master("local[1]") .appName("Read Shared Data") .getOrCreate()

val profilePath = "/home/knoldus/Desktop/Delta Open Sharing/resources/config.share" val sharedFiles = client.DeltaSharingRestClient(profilePath).listAllTables() sharedFiles.foreach(println) /// this works fine and lists all the tables in the share provided by data provider.

val popular_products_df = spark.read.format("deltaSharing").load("/home/knoldus/Desktop/Delta Open Sharing/resources/config.share#checkout_data_products.data_products.popular_products_data") popular_products_df.show()

oliverangelil commented 8 months ago

@mohika-knoldus did you resolve this? I'm having the same issue.

mohika-knoldus commented 7 months ago

No @oliverangelil .

oliverangelil commented 7 months ago

@mohika-knoldus

The solution was to install apache Hadoop. If you add some config to your spark context it will download it automatically:

spark = (SparkSession
.builder
.config('spark.jars.packages', 'org.apache.hadoop:hadoop-azure:3.3.1,io.delta:delta-core_2.12:2.2.0,io.delta:delta-sharing-spark_2.12:0.6.2')
.config('spark.sql.extensions', 'io.delta.sql.DeltaSparkSessionExtension')
.config('spark.sql.catalog.spark_catalog', 'org.apache.spark.sql.delta.catalog.DeltaCatalog')
.getOrCreate()
) 

Or you can download it from the website.

Then you can read the table in like this: delta_sharing.load_as_spark(table_url).show() or like this: spark.read.format("deltasharing").load(table_url).limit(100)

You can alternatively read the table in without Hadoop, if you use delta_sharing.load_as_pandas(table_url, limit=10)

mohika-knoldus commented 7 months ago

so either there is a dependency on python library or apache hadoop at the end ?

mohika-knoldus commented 7 months ago

Thank you for the solution. @oliverangelil