dbt-labs / dbt-bigquery

dbt-bigquery contains all of the code required to make dbt operate on a BigQuery database.
https://github.com/dbt-labs/dbt-bigquery
Apache License 2.0
216 stars 153 forks source link

[CT-2110] [Bug] Running Python models on GCP returns 409 Google Cloud Dataproc Agent reports job failure #531

Closed dchess closed 1 year ago

dchess commented 1 year ago

Is this a new bug in dbt-bigquery?

Current Behavior

As of the latest (Feb 10, 2023) release of Dataproc python models are failing to execute and throwing "409 Google Cloud Dataproc Agent reports job failure" as an error.

Expected Behavior

Python models should execute without throwing a 409.

Steps To Reproduce

dbt run when there is a .py model somewhere in the dependencies.

Relevant log output

The specific pyspark error is :
> java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: com.google.cloud.spark.bigquery.BigQueryRelationProvider Unable to get public no-arg constructor

Environment

- OS: macOS Monterrey 12.5.1
- Python: 3.10.10
- dbt-core: 1.3.2
- dbt-bigquery: 1.3.0

Additional Context

The latest release of Dataproc has shifted from v 1.1.2 to v 2.0.10. The Spark runtime versions for 2.0 indicate that Scala is now running on v 2.13.

The python_submissions.py code is calling dataproc_v1 from google.cloud but it doesn't like that package has a v2 yet. However, I did see this line with a hardcoded jar version:

https://github.com/dbt-labs/dbt-bigquery/blob/34b0a77891f5f470389f45674e31bcf0478792fa/dbt/adapters/bigquery/python_submissions.py#L130

Perhaps this could be updated to the latest 2.13 version: com.google.cloud.spark:spark-bigquery-with-dependencies_2.13:0.28.0

Fleid commented 1 year ago

@ChenyuLInx @lostmygithubaccount would you mind having a look at this? Just in terms of triaging the urgency.

colin-rogers-dbt commented 1 year ago

@dchess would you be able to share the full stacktrace?

colin-rogers-dbt commented 1 year ago

Reproduced the exception:

Using the default container image
Waiting for container log creation
PYSPARK_PYTHON=/opt/dataproc/conda/bin/python
JAVA_HOME=/usr/lib/jvm/temurin-17-jdk-amd64
SPARK_EXTRA_CLASSPATH=
:: loading settings :: file = /etc/spark/conf/ivysettings.xml
WARNING:root:'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched.
Traceback (most recent call last):
  File "/tmp/srvls-batch-c6db9d5a-e21a-4ddf-b46b-901f6e579bdf/simple_python_model.py", line 124, in <module>
    .save("dbt-test-env.test16763396294271220665_test_python_model.simple_python_model")
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 968, in save
  File "/usr/lib/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 190, in deco
  File "/usr/lib/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o94.save.
: java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: com.google.cloud.spark.bigquery.BigQueryRelationProvider Unable to get public no-arg constructor
    at java.base/java.util.ServiceLoader.fail(ServiceLoader.java:586)
    at java.base/java.util.ServiceLoader.getConstructor(ServiceLoader.java:679)
    at java.base/java.util.ServiceLoader$LazyClassPathLookupIterator.hasNextService(ServiceLoader.java:1240)
    at java.base/java.util.ServiceLoader$LazyClassPathLookupIterator.hasNext(ServiceLoader.java:1273)
    at java.base/java.util.ServiceLoader$2.hasNext(ServiceLoader.java:1309)
    at java.base/java.util.ServiceLoader$3.hasNext(ServiceLoader.java:1393)
    at scala.collection.convert.JavaCollectionWrappers$JIteratorWrapper.hasNext(JavaCollectionWrappers.scala:37)
    at scala.collection.StrictOptimizedIterableOps.filterImpl(StrictOptimizedIterableOps.scala:225)
    at scala.collection.StrictOptimizedIterableOps.filterImpl$(StrictOptimizedIterableOps.scala:222)
    at scala.collection.convert.JavaCollectionWrappers$JIterableWrapper.filterImpl(JavaCollectionWrappers.scala:58)
    at scala.collection.StrictOptimizedIterableOps.filter(StrictOptimizedIterableOps.scala:218)
    at scala.collection.StrictOptimizedIterableOps.filter$(StrictOptimizedIterableOps.scala:218)
    at scala.collection.convert.JavaCollectionWrappers$JIterableWrapper.filter(JavaCollectionWrappers.scala:58)
    at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:657)
    at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:725)
    at org.apache.spark.sql.DataFrameWriter.lookupV2Provider(DataFrameWriter.scala:864)
    at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:256)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:568)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
    at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
    at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.lang.NoClassDefFoundError: scala/Serializable
    at java.base/java.lang.ClassLoader.defineClass1(Native Method)
    at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1012)
    at java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)
    at java.base/java.net.URLClassLoader.defineClass(URLClassLoader.java:524)
    at java.base/java.net.URLClassLoader$1.run(URLClassLoader.java:427)
    at java.base/java.net.URLClassLoader$1.run(URLClassLoader.java:421)
    at java.base/java.security.AccessController.doPrivileged(AccessController.java:712)
    at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:420)
    at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:587)
    at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520)
    at java.base/java.lang.Class.getDeclaredConstructors0(Native Method)
    at java.base/java.lang.Class.privateGetDeclaredConstructors(Class.java:3373)
    at java.base/java.lang.Class.getConstructor0(Class.java:3578)
    at java.base/java.lang.Class.getConstructor(Class.java:2271)
    at java.base/java.util.ServiceLoader$1.run(ServiceLoader.java:666)
    at java.base/java.util.ServiceLoader$1.run(ServiceLoader.java:663)
    at java.base/java.security.AccessController.doPrivileged(AccessController.java:569)
    at java.base/java.util.ServiceLoader.getConstructor(ServiceLoader.java:674)
    ... 28 more
Caused by: java.lang.ClassNotFoundException: scala.Serializable
    at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:445)
    at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:587)
    at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520)
    ... 46 more

I'm pretty sure it's the change to default to the 2.0 runtime environment. Exploring if specifying 1.1 addresses this.

dchess commented 1 year ago

@colin-rogers-dbt Sorry for the delay. The stacktrace you shared matches what we are seeing in our errors as well.

colin-rogers-dbt commented 1 year ago

automation closed this, keeping open until 1.4 / 1.3 patches are out

nathaniel-may commented 1 year ago

dbt-bigquery 1.4.1 and 1.3.1 have now been released with the fix. @dchess, can you confirm that this fixes the issue for you?

dchess commented 1 year ago

@nathaniel-may I upgraded to the latest release and reran our python model locally and it ran successfully. In dbt cloud, however, the error persists. I am wondering if that is because dbt cloud is still using v 1.2?

nathaniel-may commented 1 year ago

Thanks for confirming that this fixes the issue locally. We will need to make this release available in cloud before it will be fixed there as well. I'll report back here when you can expect the patch to be available.

nathaniel-may commented 1 year ago

@dchess the patch should be available in cloud some time today. Please let us know if you still have any trouble connecting after that.

dchess commented 1 year ago

We saw a successful run in dbt cloud so it looks like this is resolved for now. Thanks so much!

aleloi commented 10 months ago

Hi, I just got this issue when running with the latest docker image. My dockerfile is:

FROM ghcr.io/dbt-labs/dbt-bigquery:latest # downgrading to 1.6.6 fixes the issue
WORKDIR /usr/src/app
COPY . .
ENV PROJECT_ID=<my-project-id>
CMD ["run", "--profiles-dir", "dbt_profile_dir", "--project-dir", "<my-project-dir>/"]

My dbt profile is

<my-project-name>:
  target: prod
  outputs:
    prod:
      dataset: <my-dataset>
      job_execution_timeout_seconds: 600
      job_retries: 1
      location: europe-west1
      method: oauth
      priority: interactive
      project: "{{ env_var('PROJECT_ID', '') }}"
      threads: 1
      type: bigquery
      gcs_bucket: <my-gcs-bucket>
      dataproc_region: europe-west1

With this setup, pyspark crashes when running a 'dbt.ref'. The JVM stack trace looks like the one above in https://github.com/dbt-labs/dbt-bigquery/issues/531#issuecomment-1428994813

Using the default container image
Waiting for container log creation
PYSPARK_PYTHON=/opt/dataproc/conda/bin/python
Generating /home/spark/.pip/pip.conf
Configuring index-url as 
Configuring target as /mnt/dataproc/python/site-packages
JAVA_HOME=/usr/lib/jvm/temurin-17-jdk-amd64
SPARK_EXTRA_CLASSPATH=
:: loading settings :: file = /etc/spark/conf/ivysettings.xml
Initialized pyspark model calculation
Traceback (most recent call last):
  File "/var/dataproc/tmp/srvls-batch-28872ee4-d955-4ff0-bc8d-8361c1a46df0/my_model.py", line 249, in <module>
    df = model(dbt, spark)
  File "/var/dataproc/tmp/srvls-batch-28872ee4-d955-4ff0-bc8d-8361c1a46df0/my_model.py", line 56, in model
    df_pyspark = dbt.ref("my_table")
  File "/var/dataproc/tmp/srvls-batch-28872ee4-d955-4ff0-bc8d-8361c1a46df0/my_model.py", line 239, in <lambda>
    self.ref = lambda *args: ref(*args, dbt_load_df_function=load_df_function)
  File "/var/dataproc/tmp/srvls-batch-28872ee4-d955-4ff0-bc8d-8361c1a46df0/my_model.py", line 207, in ref
    return dbt_load_df_function(refs[key])
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 177, in load
  File "/usr/lib/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 190, in deco
  File "/usr/lib/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o75.load.
: java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: com.google.cloud.spark.bigquery.BigQueryRelationProvider Unable to get public no-arg constructor
    at java.base/java.util.ServiceLoader.fail(ServiceLoader.java:586)
    at java.base/java.util.ServiceLoader.getConstructor(ServiceLoader.java:679)
    at java.base/java.util.ServiceLoader$LazyClassPathLookupIterator.hasNextService(ServiceLoader.java:1240)
    at java.base/java.util.ServiceLoader$LazyClassPathLookupIterator.hasNext(ServiceLoader.java:1273)
    at java.base/java.util.ServiceLoader$2.hasNext(ServiceLoader.java:1309)
    at java.base/java.util.ServiceLoader$3.hasNext(ServiceLoader.java:1393)
    at scala.collection.convert.JavaCollectionWrappers$JIteratorWrapper.hasNext(JavaCollectionWrappers.scala:37)
    at scala.collection.StrictOptimizedIterableOps.filterImpl(StrictOptimizedIterableOps.scala:225)
    at scala.collection.StrictOptimizedIterableOps.filterImpl$(StrictOptimizedIterableOps.scala:222)
    at scala.collection.convert.JavaCollectionWrappers$JIterableWrapper.filterImpl(JavaCollectionWrappers.scala:58)
    at scala.collection.StrictOptimizedIterableOps.filter(StrictOptimizedIterableOps.scala:218)
    at scala.collection.StrictOptimizedIterableOps.filter$(StrictOptimizedIterableOps.scala:218)
    at scala.collection.convert.JavaCollectionWrappers$JIterableWrapper.filter(JavaCollectionWrappers.scala:58)
    at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:657)
    at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:725)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:207)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:185)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:568)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
    at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
    at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: java.lang.NoClassDefFoundError: scala/Serializable
    at java.base/java.lang.ClassLoader.defineClass1(Native Method)
    at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1017)
    at java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)
    at java.base/java.net.URLClassLoader.defineClass(URLClassLoader.java:524)
    at java.base/java.net.URLClassLoader$1.run(URLClassLoader.java:427)
    at java.base/java.net.URLClassLoader$1.run(URLClassLoader.java:421)
    at java.base/java.security.AccessController.doPrivileged(AccessController.java:712)
    at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:420)
    at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:592)
    at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525)
    at java.base/java.lang.Class.getDeclaredConstructors0(Native Method)
    at java.base/java.lang.Class.privateGetDeclaredConstructors(Class.java:3373)
    at java.base/java.lang.Class.getConstructor0(Class.java:3578)
    at java.base/java.lang.Class.getConstructor(Class.java:2271)
    at java.base/java.util.ServiceLoader$1.run(ServiceLoader.java:666)
    at java.base/java.util.ServiceLoader$1.run(ServiceLoader.java:663)
    at java.base/java.security.AccessController.doPrivileged(AccessController.java:569)
    at java.base/java.util.ServiceLoader.getConstructor(ServiceLoader.java:674)
    ... 27 more
Caused by: java.lang.ClassNotFoundException: scala.Serializable
    at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:445)
    at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:592)
    at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525)
    ... 45 more