Open sangeethsasidharan opened 10 months ago
I am also running into this issue, but using a DeltaCatalog.
Copy/Paste code to reproduce:
from pyspark.sql.types import StructType, StructField, StringType
from pyspark.sql import SparkSession
from pyspark.sql import functions as sf
import concurrent.futures
def write_table(spark, table_name):
spark.sql("SELECT 'def' as name").write.mode("append").saveAsTable(table_name)
def test_simple(spark: SparkSession):
spark.sql("DROP TABLE IF EXISTS test_table")
spark.sql("DROP TABLE IF EXISTS test_table_2")
empty_df = spark.createDataFrame([], StructType([StructField("name", StringType())]))
empty_df.write.saveAsTable("test_table")
empty_df.write.saveAsTable("test_table_2")
# OK
write_table(spark, "test_table")
# Not OK
with concurrent.futures.ThreadPoolExecutor() as executor:
futures = executor.submit(write_table, spark, "test_table_2")
futures.result()
test_simple(spark.getActiveSession())
Relevant Spark Config:
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config("spark.sql.catalogImplementation", "in-memory")
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
Have tried variations of this config with no success.
Environment info:
The code actually runs ok on Databricks 14.3 LTS.
The issue can be worked-around by using the DataFrameWriterV2 methods instead:
spark.sql("SELECT 'def' as name").writeTo(table_name).append()
Bug
Describe the problem
My simplified use case is to read from one location and append the data to a Delta Lake table with a Hive Metastore in batches. I have to do this for a couple of tables concurrently, so I use a Python ThreadPoolExecutor for it. Each thread executes the above append operation for different tables
But when I use append mode with saveAsTable on the second batch append I get this error
The column number of the existing table spark_catalog.test_schema.delta_dummy (struct<>) doesn't match the data schema (struct<id:int,_c1:string>).
Somehow it's not able to get the target table's current schema,
Steps to reproduce
` from pyspark.sql import SparkSession from delta.tables import *
import concurrent.futures
AWS_ACCESS_KEY_ID="AWS_ACCESS_KEY_ID" AWS_SECRET_ACCESS_KEY="AWS_SECRET_ACCESS_KEY" dep_packages = 'io.delta:delta-spark_2.12:3.0.0,'\ 'org.apache.spark:spark-avro_2.12:3.5.0,'\ 'org.apache.hadoop:hadoop-aws:3.3.1,'\ 'com.amazonaws:aws-java-sdk-bundle:1.11.901'
spark = SparkSession\ .builder\ .appName("pyspark-notebook")\ .config("spark.jars.packages", dep_packages)\ .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")\ .config("hive.metastore.uris", "thrift://localhost:9085")\ .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")\ .config("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")\ .config("spark.hadoop.fs.s3a.access.key", AWS_ACCESS_KEY_ID)\ .config("spark.hadoop.fs.s3a.secret.key", AWS_SECRET_ACCESS_KEY)\ .config("spark.sql.warehouse.dir", "s3a://dw_path/delta_db")\ .enableHiveSupport()\ .getOrCreate()
def dummy_check_for_new_changes(logic_func): with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor: executor.submit(logic_func)
def insert_data(): try: df = spark.read.format("csv")\ .option("header", "true")\ .option("inferSchema", "true")\ .load('test.csv') df.show() df\ .write\ .saveAsTable("test_schema.delta_dummy", format='delta', mode='append') print("inserted data") except Exception as e: print(e) raise e
def execute_code(): insert_data() insert_data()
dummy_check_for_new_changes(execute_code)
`
Observed results
Getting the below error on the second append operation
The column number of the existing table spark_catalog.rbl_aura_ledger.hold_history_delta_dummy1 (struct<>) doesn't match the data schema (struct<id:int,_c1:string>).
Expected results
Its suppose to append the data to the table
Further details
Environment information