apache / beam

Apache Beam is a unified programming model for Batch and Streaming data processing.
https://beam.apache.org/
Apache License 2.0
7.9k stars 4.27k forks source link

[Task]: Switch to portable Data and Time type for JdbcIO #28359

Open Abacn opened 1 year ago

Abacn commented 1 year ago

What needs to happen?

Currently there are still a couple of non-portable logical types defined in https://github.com/apache/beam/blob/926774dd02be5eacbe899ee5eceab23afb30abca/sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/LogicalTypes.java

This prevents cross-lang JdbcIO to read / write rows with these types. Most commonly used are Date and Time types. We should migrate them to portable logical types and also support them in Python side.

Issue Priority

Priority: 2 (default / most normal work should be filed as P2)

Issue Components

Abacn commented 1 year ago

Note: the following workaround can be used for columns involving Date (or Time) type, that is register your own logical type:


@LogicalType.register_logical_type
class DateType(LogicalType[datetime.date, MillisInstant, str]):
  def __init__(self, unused=""):
    pass

  @classmethod
  def representation_type(cls):
    # type: () -> type
    return Timestamp

  @classmethod
  def urn(cls):
    return "beam:logical_type:javasdk:v1"

  @classmethod
  def language_type(cls):
    return datetime.date

  def to_representation_type(self, value):
     # type: (datetime.date) -> Timestamp
     return Timestamp.from_utc_datetime(datetime.datetime.combine(value, datetime.datetime.min.time(), tzinfo=datetime.timezone.utc))

  def to_language_type(self, value):
    # type: (Timestamp) -> datetime.date

    return value.to_utc_datetime().date()

  @classmethod
  def argument_type(cls):
    return str

  def argument(self):
     return ""

  @classmethod
  def _from_typing(cls, typ):
    return cls()

I recall why the Date type support was incomplete. The Java JdbcIO implemented Date and Time with non-portable logical type backed by joda Instant, while the Beam portable Date and Time logical type are backed by more modern java.time.localDate or localTime. We cannot simply change the Java JdbcIO due to concern of breaking change. This involves two difficult compatibility issues

One fix could be done for transition is that adding a flag to IOs that guide them to produce java.time results and use portable logical types. This would enable Python implementation (and also eliminate the need of call LogicalType.register_logical_type(MillisInstant) if need to overwrite logical type mapping.

lazarillo commented 1 month ago

The link provided in migrating from joda time to java.time points to a PR that is behind Google's corp firewall.

It would help if the issues were exposed to the OS community, since it's an OS issue.

liferoad commented 1 month ago

Thanks, just updated the comment.