GoogleCloudDataproc / spark-bigquery-connector

BigQuery data source for Apache Spark: Read data from BigQuery into DataFrames, write DataFrames into BigQuery tables.
Apache License 2.0
358 stars 189 forks source link

Disable Schema Changes #1244

Open gmiano opened 3 weeks ago

gmiano commented 3 weeks ago

Hi,

I am using the BigQuery connector on a DataProc cluster to create a DataFrame and store it within a BigQuery table. The table is created via Terraform with a hardcoded schema where all fields are set to REQUIRED.

This table is refreshed every day, and the results are overwritten. The code used to write to this table is as follows:

data.write.format("bigquery") \
    .mode("overwrite") \
    .option("table", table_fullname) \
    .option("createDisposition", "CREATE_NEVER") \
    .option("allowFieldAddition", "false") \
    .option("allowFieldRelaxation", "false") \
    .option("temporaryGcsBucket", self._config.get_config("temp_gcs_bucket")) \
    .save()

Unfortunately, this configuration does not work as expected. When writing to the table, the schema is updated, and many fields are switched from REQUIRED to NULLABLE. Since the connector version wasn't explicitly specified, it should be the default version, which is spark-3.5-bigquery-0.39.0.jar.

Any guidance on resolving this issue would be greatly appreciated. Thank you!