GoogleCloudDataproc / spark-bigquery-connector

BigQuery data source for Apache Spark: Read data from BigQuery into DataFrames, write DataFrames into BigQuery tables.
Apache License 2.0
378 stars 198 forks source link

Support for spark version 3.5.0 - support required for UNTYPED_SCALA_UDF version #1297

Open VIKCT001 opened 1 month ago

VIKCT001 commented 1 month ago

with spark 3.4 and scala 2.12.16 version on dataproc image 2. we were able to run our jobs by setting the below property. --properties=spark.sql.legacy.allowUntypedScalaUDF=true

but since we have been migrated to 3.5 and scala 2.12.18 version of dataproc image 2. we are getting below error message.

"exception": "AnalysisException: [UNTYPED_SCALA_UDF] You\u0027re using untyped Scala UDF, which does not have the input type information. Spark may blindly pass null to the Scala closure with primitive-type argument, and the closure will see the default value of the Java type for the null argument, e.g. udf((x: Int) \u003d\u003e x, IntegerType), the result is 0 for null input. To get rid of this error, you could:\n1. use typed Scala UDF APIs(without return type parameter), e.g. udf((x: Int) \u003d\u003e x).\n2. use Java UDF APIs, e.g. udf(new UDF1[String, Integer] { override def call(s: String): Integer \u003d s.length() }, IntegerType), if input types are all non primitive.\n3. set "spark.sql.legacy.allowUntypedScalaUDF" to "true" and use this API with caution."

is this property being deprecated with spark 3.5 version?

davidrabinowitz commented 1 month ago

Is it related to the Spark BigQuery connector? I fail to see the relation. For a general Dataproc support please see https://cloud.google.com/dataproc/docs/support/getting-support

VIKCT001 commented 1 month ago

yes simialr code was working fine with spark 3.3 and scala 2.12.16 version on dataproc image 2.0. it just that we need to specify the spark.sql.legacy.allowUntypedScalaUDF=true when submitting the spark job to dataproc cluster and with old bigquery connector.

Since then, we been migrated to image 2.2 and scala 2.12.18 and spark 3.5 version(which are compatible for new dataproc image) we are facing this issue.