databrickslabs / dbx

🧱 Databricks CLI eXtensions - aka dbx is a CLI tool for development and advanced Databricks workflows management.
https://dbx.readthedocs.io
Other
437 stars 119 forks source link

Failed to dbx execute python wheel task which uses jar dependency (spark-redis) #836

Open igorgatis opened 11 months ago

igorgatis commented 11 months ago

Expected Behavior

Data should be written to redis.

Current Behavior

Fails with class not found exception.

Steps to Reproduce (for bugs)

Task code:

    spark = (
        SparkSession.builder.appName("redis-df")
        .config("spark.redis.host", "[redacted]")
        .config("spark.redis.port", "[redacted]")
        .config("spark.redis.auth", "[redacted]")
        .getOrCreate()
    )
    (
        spark.read.table("sometable")
        .write.format("org.apache.spark.sql.redis")
        .option("table", "condo_compound")
        .option("key.column", "rediskey")
        .save()
    )

The deployment.yml file:

environments:
  default:
    workflows:
      - name: "myworkflow"
        tasks:
          - task_key: "maintask"
            libraries:
              - maven:
                coordinates: "com.redislabs:spark-redis_2.12:3.1.0"          
            python_wheel_task:
              package_name: "mylib"
              entry_point: "myentrypoint"
              parameters: []

It fails with the following exception:

Py4JJavaError: An error occurred while calling o412.save.
: org.apache.spark.SparkClassNotFoundException: [DATA_SOURCE_NOT_FOUND] Failed to find data source: org.apache.spark.sql.redis. Please find packages at 
`https://spark.apache.org/third-party-projects.html`.
        at org.apache.spark.sql.errors.QueryExecutionErrors$.dataSourceNotFoundError(QueryExecutionErrors.scala:892)
        at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:735)
        at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:785)
        at org.apache.spark.sql.DataFrameWriter.lookupV2Provider(DataFrameWriter.scala:960)
        at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:288)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:258)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:397)
        at py4j.Gateway.invoke(Gateway.java:306)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:195)
        at py4j.ClientServerConnection.run(ClientServerConnection.java:115)
        at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.redis.DefaultSource
        at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
        at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$5(DataSource.scala:721)
        at scala.util.Try$.apply(Try.scala:213)
        at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$4(DataSource.scala:721)
        at scala.util.Failure.orElse(Try.scala:224)
        at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:721)
        ... 16 more

Context

Your Environment