When writing a spark dataframe to an existing partitioned BQ table I end up with the table modified in an expected way (partition added/modified). However, the additional table is being saved (it consists of the exact data of the dataframe that I was adding to the other table).
To reproduce:
database state: empty
from pyspark.sql import SparkSessionspark = SparkSession.builder.config("spark.sql.sources.partitionOverwriteMode", "DYNAMIC").config("enableReadSessionCaching", "false").getOrCreate()sparksdf.write.format("bigquery").option('partitionField', 'curdate').option('partitionType', 'DAY').mode('overwrite').save(f"{gcp_project_id}.{db}.{table_name}")
database state:
one table named {table_name} - data as in sdf
database state:
one table named {table_name} - data as in sdf with new data from sdf_2 (or if sdf_2 consists of the same partitions as there were in sdf, the original partitions are overwritten)
ADDITIONAL table named {table_name}_randomnumbers (eg. table_name4467706876500)
Can you modify the saving function to not save this additional table (or drop it after the save process)?
When writing a spark dataframe to an existing partitioned BQ table I end up with the table modified in an expected way (partition added/modified). However, the additional table is being saved (it consists of the exact data of the dataframe that I was adding to the other table). To reproduce: database state: empty
from pyspark.sql import SparkSession
spark = SparkSession.builder.config("spark.sql.sources.partitionOverwriteMode", "DYNAMIC").config("enableReadSessionCaching", "false").getOrCreate()
spark
sdf.write.format("bigquery").option('partitionField', 'curdate').option('partitionType', 'DAY').mode('overwrite').save(f"{gcp_project_id}.{db}.{table_name}")
database state: one table named {table_name} - data as in sdf
sdf_2.write.format("bigquery").mode('overwrite').save(f"{gcp_project_id}.{db}.{table_name}")
database state: one table named {table_name} - data as in sdf with new data from sdf_2 (or if sdf_2 consists of the same partitions as there were in sdf, the original partitions are overwritten) ADDITIONAL table named {table_name}_randomnumbers (eg. table_name4467706876500)
Can you modify the saving function to not save this additional table (or drop it after the save process)?