[WIP][Spark] Allow type widening for all supported type changes

johanl-db commented 2 weeks ago

The type changes added in this PR only work with Spark 4.0 / master which contains the required changes to Parquet readers to be able to read the data after applying the type changes.

Description

Extend the list of supported type changes for type widening to include changes that can be supported with Spark 4.0:

(byte, short, int) -> long
float -> double
date -> timestampNTZ
(byte, short, int) -> double
decimal -> decimal (with increased precision/scale that doesn't cause precision loss)
(byte, short, int, long) -> decimal

How was this patch tested?

Adding test cases for the new type changes in the existing type widening test suites

Does this PR introduce any user-facing changes?

Yes: allow using the listed type changes with type widening, either via ALTER TABLE CHANGE COLUMN TYPE or during schema evolution in MERGE and INSERT.

KamilKandzia commented 2 weeks ago

Will be in future an option to change the column type of a table from int to string without overwriting the entire table? Unless such an option is now available (but I don't remember that)

johanl-db commented 2 weeks ago

Will be in future an option to change the column type of a table from int to string without overwriting the entire table? Unless such an option is now available (but I don't remember that)

There's no plan currently to support other type changes than the ones mentioned in the PR description.

Converting values when reading from a table that had one of these widening type changes applied can be easily done directly in the Parquet reader, but other type changes are harder either because:

They can lead to overflow or loss of precision. For example long -> int or long -> float.
The conversion is ambiguous in Parquet. For float -> string: how many significant digits should be displayed? For decimal -> string: should the value be padded with 0s to match the precision/scale of the value. Even for int -> string, we could ask if the raw bytes of the initial value should be returned as string or the value should be formatted as UTF8.

delta-io / delta