apache / beam

Apache Beam is a unified programming model for Batch and Streaming data processing.
https://beam.apache.org/
Apache License 2.0
7.9k stars 4.27k forks source link

[Task]: Support more Beam portable schema types as Python types #25946

Open ahmedabu98 opened 1 year ago

ahmedabu98 commented 1 year ago

What needs to happen?

Beam portable schemas include primitive and more complex types (represented as logical types). Some of these types are supported in the Python SDK: https://github.com/apache/beam/blob/99202b237e364bf77f40b6da0ec22cb7b17c37d0/sdks/python/apache_beam/typehints/schemas.py#L23-L41

When necessary, Python classes are created to represent a portable type. For example, see Timestamp below: https://github.com/apache/beam/blob/99202b237e364bf77f40b6da0ec22cb7b17c37d0/sdks/python/apache_beam/utils/timestamp.py#L45

There are some missing portable types in the Python SDK (e.g. Date, DateTime, Time) that we should add support for to make the cross-language experience more smooth.

Issue Priority

Priority: 2 (default / most normal work should be filed as P2)

Issue Components

pavleec commented 6 months ago

JSON type is also missing in Python SDK 😕

unography commented 4 months ago

Hi @ahmedabu98 , currently GEOGRAPHY as a data type isn't supported, and it throws error here: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery_tools.py#L110-L123

Are there plans to add support for it?

jd185367 commented 1 month ago

I'd also like to bump this as needed for using WriteToBigQuery in Python:

https://github.com/apache/beam/blob/fa9eb2fe17f5f96b40275fe7b0a3981f4a52e0df/sdks/python/apache_beam/io/gcp/bigquery.py#L1887

Google recommends using the STORAGE_WRITE_API method in their Dataflow Best Practices, which requires passing this transform the schema argument for a table. But since many of our BigQuery tables have a DATE or DATETIME column, which isn't supported yet for these schemas in Python, we aren't able to use this.

As of Beam 2.60.0, we haven't found a current workaround - e.g. specifying our DATE columns as TIMESTAMP in the Python schema seems to fail either when Beam tries to actually write to BigQuery, or at some point when the Java code is executing and doing its own conversion. If anyone knows a workaround for this, I'd appreciate it.

As a side-note: why does STORAGE_WRITE_API require specifying a schema in advance, while STREAMING_INSERT does not?