confluentinc / kafka-connect-storage-common

Shared software among connectors that target distributed filesystems and cloud storage.
Other
5 stars 155 forks source link

CCLOG-2401: Parse logical types correctly for hive schemas. #299

Closed snehashisp closed 1 year ago

snehashisp commented 1 year ago

Problem

Some logical connect data types are not correctly inferred to their hive counterpart. These include Date and Timestamp datatypes which are inferred as int and long previously. However in the orc writer these are inserted as Timestamp and Dates respectively which causes data with such datatypes to fail when queried in hive. Time logical type can also either be int or long.

There is also a bug in the recursive parsing logic where if logical types are not handled if nested inside a struct, array or map. This is because in case of struct the convertPrimitiveMaybeLogical is never reached as the recursive calls always goes to the not logical convert method.

Solution

Assign the correct hive data types for the logical connect types. The hive logical conversion is not used currently in any of the downstream hdfs connectors which use this package. The current orc writer does not correctly infer logical types hence there should not be any backwards compatibility issues. Regardless this will be released as a minor version bump.

The recursive logic is fixed and some tests are added.

Does this solution apply anywhere else?
If yes, where?

Test Strategy

Testing done:

Release Plan

sudeshwasnik commented 1 year ago

QQ - is this a major release? Since it is backward-incompatible fix ? (schema is going to change).

snehashisp commented 1 year ago

QQ - is this a major release? Since it is backward-incompatible fix ? (schema is going to change).

Most places downstream, that I know of, use the primitive approach to conversion and hence will not be affected. Orc will move to the new schemas. The existing issues with logical types means most likely they are not being used, hence compatibility should be fine. Thinking of putting this under a minor version bump.

sudeshwasnik commented 1 year ago

@snehashisp gotcha. imo if schema can change with upgrade, we should term it backward-incompatible. If it is, ideally we should make a major release. @confluentinc/connect-team1 wdyt ? I don't have a strong opinion, but lets get one additional +1 for minor release.

sudeshwasnik commented 1 year ago

The changes LGTM! Nice work @snehashisp !!

pbadani commented 1 year ago

@snehashisp gotcha. imo if schema can change with upgrade, we should term it backward-incompatible. If it is, ideally we should make a major release. @confluentinc/connect-team1 wdyt ? I don't have a strong opinion, but lets get one additional +1 for minor release.

ok with minor release here. more important question is whether it will be a patch or a new version for the connector.