delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.62k stars 1.71k forks source link

[BUG] Incorrect error message when using INSERT SELECT *, and source table has less columns than target table #3701

Open felipepessoto opened 2 months ago

felipepessoto commented 2 months ago

Bug

Which Delta project/connector is this regarding?

Describe the problem

The error message is misleading: [DELTA_DUPLICATE_COLUMNS_FOUND] Found duplicate column(s) in the data to save: name

Steps to reproduce

DROP TABLE IF EXISTS MySourceTable;
DROP TABLE IF EXISTS MyTargetTable;
CREATE TABLE MySourceTable USING DELTA AS SELECT 1 as Id, 30 as Age, 'John' as Name;
CREATE TABLE MyTargetTable (Id INT, Name STRING) USING DELTA;
INSERT INTO MyTargetTable SELECT * FROM MySourceTable;

Observed results

[DELTA_DUPLICATE_COLUMNS_FOUND] Found duplicate column(s) in the data to save: name org.apache.spark.sql.delta.schema.SchemaMergingUtils$.checkColumnNameDuplication(SchemaMergingUtils.scala:123) org.apache.spark.sql.delta.schema.SchemaMergingUtils$.mergeSchemas(SchemaMergingUtils.scala:168) org.apache.spark.sql.delta.schema.ImplicitMetadataOperation$.mergeSchema(ImplicitMetadataOperation.scala:219) org.apache.spark.sql.delta.schema.ImplicitMetadataOperation.updateMetadata(ImplicitMetadataOperation.scala:84) org.apache.spark.sql.delta.schema.ImplicitMetadataOperation.updateMetadata$(ImplicitMetadataOperation.scala:66) org.apache.spark.sql.delta.commands.WriteIntoDelta.updateMetadata(WriteIntoDelta.scala:77) org.apache.spark.sql.delta.commands.WriteIntoDelta.writeAndReturnCommitData(WriteIntoDelta.scala:162) org.apache.spark.sql.delta.commands.WriteIntoDelta.$anonfun$run$1(WriteIntoDelta.scala:106) org.apache.spark.sql.delta.commands.WriteIntoDelta.$anonfun$run$1$adapted(WriteIntoDelta.scala:101) org.apache.spark.sql.delta.DeltaLog.withNewTransaction(DeltaLog.scala:227) org.apache.spark.sql.delta.commands.WriteIntoDelta.run(WriteIntoDelta.scala:101) org.apache.spark.sql.delta.catalog.WriteIntoDeltaBuilder$$anon$1$$anon$2.insert(DeltaTableV2.scala:432) org.apache.spark.sql.execution.datasources.v2.SupportsV1Write.writeWithV1(V1FallbackWriters.scala:79)

Expected results

A message saying the data source schema doesn't match the columns of columns

Environment information

Willingness to contribute

The Delta Lake Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the Delta Lake code base?

felipepessoto commented 2 months ago

It happens because it expands the * in INSERT INTO MyTargetTable SELECT * FROM MySourceTable into: INSERT INTO MyTargetTable SELECT Id as Id, Age as Name, Name FROM MySourceTable, which makes sense since the second column in the target is Name. I think we need a column length validation first.

Project [Id#724 AS Id#760, cast(Age#725 as string) AS Name#761, Name#726]