Closed lorenh516 closed 1 month ago
Are you running this on Dataproc serverless? What is the runtime version?
Yes, running on Dataproc serverless with default runtime version. Looks like it's Spark runtime 2.2 LTS
based on gcloud documentation. Update: The failed batch details subpage lists Version: 2.2.25
.
Also noting that the error persisted when I tried using the .jar
with latest connector version: https://storage.googleapis.com/spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.40.0.jar
.
If you are running on Dataproc serverless, there's no need to set the connectors in the spark.jar property as they are built into the image. You can use the latest connector by setting --properties dataproc.sparkBqConnector.uri=gs://spark-lib/bigquery/spark-3.5-bigquery-0.41.0.jar
on batch creation.
thanks! i've switched to doing that, though it didn't resolve my issue.
Can you please share the full stack trace from the 0.41.0 connector?
I've made several changes to my code since last week and have since been able to write the joined table to BQ/ have not been able to reproduce the error. I'm going to close this issue for now and will open a new one if I run into it again.
I've recently been running into an issue when I try to write a PySpark df to an existing, partitioned BigQuery table via Dataproc. I'm getting an internal error from Spark related to a
java.lang.NullPointerException
.The error only seems to occur when I am writing a ~table~ df that is the result of a join. I've verified that I don't have empty values in the columns that are required in the BQ table. How do I get around this error to write the table to BigQuery?
This is the failing call:
I initialized the spark session with:
Full stack trace below: