[Feature Request] Support data type changes for schema evolution

frankyangdev commented 2 years ago

Feature request

Overview

Currently, schema changes are eligible for schema evolution during table appends or overwrites:

Adding new columns
Changing of data types from NullType -> any other type, or upcasts from ByteType -> ShortType -> IntegerType

Is it possible to support more data type changes during append operation with enabled schema evolution?

Motivation

For the history table updates, we want to keep all changed records as a newly appended record instead of overwriting schema when the data type has been changed from source with enabled schema evolution

For example, when the scale value of the decimal type is changed from 2 to 4 and the precision is kept unchanged

//i.e. data type of one column is changed from decimal(38,2) to decimal(38,4)
df.write.format("delta").option("mergeSchema", "true").mode("append").save(targetPath)

The error is Failed to merge decimal types with incompatible scale 2 and 4

Can this decimal scale change be supported in Delta schema evolution during table appends and please review other data types as well?

Willingness to contribute

The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?

[ ] Yes. I can contribute this feature independently.
[x] Yes. I would be willing to contribute this feature with guidance from the Delta Lake community.
[ ] No. I cannot contribute this feature at this time.

dennyglee commented 2 years ago

Thanks @frankyangdev as per our Slack conversations, this one may be a little tricky. Let me summarize some key points and add some additional points thanks to @zsxwing

Reviewing the Parquet Logical Type Definitions, it seems that decimals are stored as logical data types backed by integers per Data types in Apache Parquet - thanks @bartosz25
When reviewing Spark supported data types I was reminded that right now Delta Lake conversion support is ByteType -> ShortType -> IntegerType. We currently don’t support IntegerType -> LongType. This may be coming into play here - in that Spark is using java.math.BigDecimal to support the conversion but Parquet itself is storing things using the various primitives.
To add to it, when reviewing ParquetSchemaConverter.scala L589-L611, this may not be a simple change. Decimal may use different underlying types. When a column has different types in different parquet files, there may be issues to read it back directly.

Saying this, if others can chime in and perhaps I may be missing something here and perhaps over-complicating things?

frankyangdev commented 2 years ago

Thanks @dennyglee for the checking and explanation. I hope this data type change of decimal can be supported in schema evolution in the future. Cheers

zmeir commented 2 years ago

We currently don’t support IntegerType -> LongType

Huh... Why is that really? I naively assumed it should be just the same as ByteType -> ShortType -> IntegerType.

It just so happens that this is the exact error that led me to find this issue:

# df schema is [id: bigint]
df.write.format("delta").save(path)
# new_df schema is [id: int, name: string]
new_df.write.format("delta").mode("append").option("mergeSchema", "true").save(path)
# error: Failed to merge fields 'id' and 'id'. Failed to merge incompatible data types LongType and IntegerType

What's interesting is that the following workaround actually works:

DeltaTable.forPath(spark, path).merge(new_df, "false").whenNotMatchedInsertAll().execute()
# schema after this command: [id: bigint, name: string]

zsxwing commented 2 years ago

Huh... Why is that really? I naively assumed it should be just the same as ByteType -> ShortType -> IntegerType.

That's not the same, unfortunately. ByteType, ShortType and IntegerType are all stored using int32 in the physical parquet files, but LongType is stored using int64.

DeltaTable.forPath(spark, path).merge(new_df, "false").whenNotMatchedInsertAll().execute()

This sounds a bug to me. Could you open a new issue with a small reproduction?

liquidaty commented 1 year ago

I don't understand these two comments, especially taken together:

We currently don’t support IntegerType -> LongType

ByteType, ShortType and IntegerType are all stored using int32 in the physical parquet files, but LongType is stored using int64

Why exactly would conversion from int32 to in64 be unsupported-- is that not a bug? Even conversion from int64 to int32 should not be a problem if the original value does not exceed int32 limits.

zsxwing commented 1 year ago

Why exactly would conversion from int32 to in64 be unsupported-- is that not a bug? Even conversion from int64 to int32 should not be a problem if the original value does not exceed int32 limits.

Supporting int32 to int64 will require a non trivial change because this is not supported by parquet and we have to build a layer on top of parquet in order to support this. We consider this is a feature request rather than a bug.

liquidaty commented 1 year ago

Thanks for the update. I don't follow the logic ending with not-a-bug, but I'm not looking to argue that and appreciate the response.

This issue appears possibly related to: https://github.com/databricks/databricks-sql-cli/issues/39

in which it seems not possible to import data from a CSV file into a bigint column.

does anyone know if that other issue can be solved without solving this issue?
Is there any generally recommended utility or approach to, very simply, importing a CSV file into a spark cluster (and not running into this IntegerType -> LongType problem when loading into a bigint column)?

Perhaps these questions should be directed elsewhere; in any case any suggestions would be much appreciated!

zsxwing commented 1 year ago

@liquidaty what you are talking about is not this issue. This issue is for automatically changing the column type such as from int32 to int64. Your issue is writing an int32 value to a column whose type is int64. Could you create a separate issue to discuss instead?

liquidaty commented 1 year ago

@zsxwing thank you-- OK, will open a separate issue. It seemed related because the "Failed to merge incompatible data types LongType and IntegerType" error noted in this thread by @zmeir is the same one described in the databricks-sql-cli issue.

zsxwing commented 1 year ago

@liquidaty actually, is the error you hit coming from COPY INTO? If so, you may need to talk to your Databricks support. That one is a data ingestion feature provided by Databricks Runtime. It's not a part of the Delta Lake project. We are discussing to build a similar feature inside Delta Lake ( https://github.com/delta-io/delta/issues/1354 ), but that's a different story.

liquidaty commented 1 year ago

@zsxwing I believe so-- it's from a COPY INTO that is invoked using databricks-sql-cli, and I would assume that the cli is simply passing the COPY INTO on to databricks which in turn is passing it on to Spark. Good to know about #1354, thank you

zsxwing commented 1 year ago

@liquidaty Thanks for confirming it. The COPY INFO command is not a part of the Delta Lake project. Please talk to your Databricks support as the Delta Lake project has nothing to do with that.

delta-io / delta