GoogleCloudDataproc / spark-bigquery-connector

BigQuery data source for Apache Spark: Read data from BigQuery into DataFrames, write DataFrames into BigQuery tables.
Apache License 2.0
378 stars 198 forks source link

"spark.sql.sources.partitionOverwriteMode": "DYNAMIC" - creates additional tables #1314

Open MichalBogoryja opened 1 week ago

MichalBogoryja commented 1 week ago

When writing a spark dataframe to an existing partitioned BQ table I end up with the table modified in an expected way (partition added/modified). However, the additional table is being saved (it consists of the exact data of the dataframe that I was adding to the other table). To reproduce: database state: empty

from pyspark.sql import SparkSession spark = SparkSession.builder.config("spark.sql.sources.partitionOverwriteMode", "DYNAMIC").config("enableReadSessionCaching", "false").getOrCreate() spark sdf.write.format("bigquery").option('partitionField', 'curdate').option('partitionType', 'DAY').mode('overwrite').save(f"{gcp_project_id}.{db}.{table_name}")

database state: one table named {table_name} - data as in sdf

sdf_2.write.format("bigquery").mode('overwrite').save(f"{gcp_project_id}.{db}.{table_name}")

database state: one table named {table_name} - data as in sdf with new data from sdf_2 (or if sdf_2 consists of the same partitions as there were in sdf, the original partitions are overwritten) ADDITIONAL table named {table_name}_randomnumbers (eg. table_name4467706876500)

Can you modify the saving function to not save this additional table (or drop it after the save process)?