databrickslabs / dbx

🧱 Databricks CLI eXtensions - aka dbx is a CLI tool for development and advanced Databricks workflows management.
https://dbx.readthedocs.io
Other
443 stars 122 forks source link

Failed to dbx execute python wheel task which uses jar dependency (spark-redis) #836

Open igorgatis opened 1 year ago

igorgatis commented 1 year ago

Expected Behavior

Data should be written to redis.

Current Behavior

Fails with class not found exception.

Steps to Reproduce (for bugs)

Task code:

    spark = (
        SparkSession.builder.appName("redis-df")
        .config("spark.redis.host", "[redacted]")
        .config("spark.redis.port", "[redacted]")
        .config("spark.redis.auth", "[redacted]")
        .getOrCreate()
    )
    (
        spark.read.table("sometable")
        .write.format("org.apache.spark.sql.redis")
        .option("table", "condo_compound")
        .option("key.column", "rediskey")
        .save()
    )

The deployment.yml file:

environments:
  default:
    workflows:
      - name: "myworkflow"
        tasks:
          - task_key: "maintask"
            libraries:
              - maven:
                coordinates: "com.redislabs:spark-redis_2.12:3.1.0"          
            python_wheel_task:
              package_name: "mylib"
              entry_point: "myentrypoint"
              parameters: []

It fails with the following exception:

Py4JJavaError: An error occurred while calling o412.save.
: org.apache.spark.SparkClassNotFoundException: [DATA_SOURCE_NOT_FOUND] Failed to find data source: org.apache.spark.sql.redis. Please find packages at 
`https://spark.apache.org/third-party-projects.html`.
        at org.apache.spark.sql.errors.QueryExecutionErrors$.dataSourceNotFoundError(QueryExecutionErrors.scala:892)
        at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:735)
        at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:785)
        at org.apache.spark.sql.DataFrameWriter.lookupV2Provider(DataFrameWriter.scala:960)
        at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:288)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:258)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:397)
        at py4j.Gateway.invoke(Gateway.java:306)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:195)
        at py4j.ClientServerConnection.run(ClientServerConnection.java:115)
        at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.redis.DefaultSource
        at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
        at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$5(DataSource.scala:721)
        at scala.util.Try$.apply(Try.scala:213)
        at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$4(DataSource.scala:721)
        at scala.util.Failure.orElse(Try.scala:224)
        at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:721)
        ... 16 more

Context

Your Environment