delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.32k stars 1.65k forks source link

[Feature Request] Support data type changes for schema evolution #1111

Open frankyangdev opened 2 years ago

frankyangdev commented 2 years ago

Feature request

Overview

Currently, schema changes are eligible for schema evolution during table appends or overwrites:

Is it possible to support more data type changes during append operation with enabled schema evolution?

Motivation

For the history table updates, we want to keep all changed records as a newly appended record instead of overwriting schema when the data type has been changed from source with enabled schema evolution

For example, when the scale value of the decimal type is changed from 2 to 4 and the precision is kept unchanged

//i.e. data type of one column is changed from decimal(38,2) to decimal(38,4)
df.write.format("delta").option("mergeSchema", "true").mode("append").save(targetPath)

The error is Failed to merge decimal types with incompatible scale 2 and 4

Can this decimal scale change be supported in Delta schema evolution during table appends and please review other data types as well?

Willingness to contribute

The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?

dennyglee commented 2 years ago

Thanks @frankyangdev as per our Slack conversations, this one may be a little tricky. Let me summarize some key points and add some additional points thanks to @zsxwing

Saying this, if others can chime in and perhaps I may be missing something here and perhaps over-complicating things?

frankyangdev commented 2 years ago

Thanks @dennyglee for the checking and explanation. I hope this data type change of decimal can be supported in schema evolution in the future. Cheers

zmeir commented 2 years ago

We currently don’t support IntegerType -> LongType

Huh... Why is that really? I naively assumed it should be just the same as ByteType -> ShortType -> IntegerType.

It just so happens that this is the exact error that led me to find this issue:

# df schema is [id: bigint]
df.write.format("delta").save(path)
# new_df schema is [id: int, name: string]
new_df.write.format("delta").mode("append").option("mergeSchema", "true").save(path)
# error: Failed to merge fields 'id' and 'id'. Failed to merge incompatible data types LongType and IntegerType

What's interesting is that the following workaround actually works:

DeltaTable.forPath(spark, path).merge(new_df, "false").whenNotMatchedInsertAll().execute()
# schema after this command: [id: bigint, name: string]
zsxwing commented 2 years ago

Huh... Why is that really? I naively assumed it should be just the same as ByteType -> ShortType -> IntegerType.

That's not the same, unfortunately. ByteType, ShortType and IntegerType are all stored using int32 in the physical parquet files, but LongType is stored using int64.

DeltaTable.forPath(spark, path).merge(new_df, "false").whenNotMatchedInsertAll().execute()

This sounds a bug to me. Could you open a new issue with a small reproduction?

liquidaty commented 1 year ago

I don't understand these two comments, especially taken together:

We currently don’t support IntegerType -> LongType

ByteType, ShortType and IntegerType are all stored using int32 in the physical parquet files, but LongType is stored using int64

Why exactly would conversion from int32 to in64 be unsupported-- is that not a bug? Even conversion from int64 to int32 should not be a problem if the original value does not exceed int32 limits.

zsxwing commented 1 year ago

Why exactly would conversion from int32 to in64 be unsupported-- is that not a bug? Even conversion from int64 to int32 should not be a problem if the original value does not exceed int32 limits.

Supporting int32 to int64 will require a non trivial change because this is not supported by parquet and we have to build a layer on top of parquet in order to support this. We consider this is a feature request rather than a bug.

liquidaty commented 1 year ago

Thanks for the update. I don't follow the logic ending with not-a-bug, but I'm not looking to argue that and appreciate the response.

This issue appears possibly related to: https://github.com/databricks/databricks-sql-cli/issues/39

in which it seems not possible to import data from a CSV file into a bigint column.

  1. does anyone know if that other issue can be solved without solving this issue?
  2. Is there any generally recommended utility or approach to, very simply, importing a CSV file into a spark cluster (and not running into this IntegerType -> LongType problem when loading into a bigint column)?

Perhaps these questions should be directed elsewhere; in any case any suggestions would be much appreciated!

zsxwing commented 1 year ago

@liquidaty what you are talking about is not this issue. This issue is for automatically changing the column type such as from int32 to int64. Your issue is writing an int32 value to a column whose type is int64. Could you create a separate issue to discuss instead?

liquidaty commented 1 year ago

@zsxwing thank you-- OK, will open a separate issue. It seemed related because the "Failed to merge incompatible data types LongType and IntegerType" error noted in this thread by @zmeir is the same one described in the databricks-sql-cli issue.

zsxwing commented 1 year ago

@liquidaty actually, is the error you hit coming from COPY INTO? If so, you may need to talk to your Databricks support. That one is a data ingestion feature provided by Databricks Runtime. It's not a part of the Delta Lake project. We are discussing to build a similar feature inside Delta Lake ( https://github.com/delta-io/delta/issues/1354 ), but that's a different story.

liquidaty commented 1 year ago

@zsxwing I believe so-- it's from a COPY INTO that is invoked using databricks-sql-cli, and I would assume that the cli is simply passing the COPY INTO on to databricks which in turn is passing it on to Spark. Good to know about #1354, thank you

zsxwing commented 1 year ago

@liquidaty Thanks for confirming it. The COPY INFO command is not a part of the Delta Lake project. Please talk to your Databricks support as the Delta Lake project has nothing to do with that.