apache / beam

Apache Beam is a unified programming model for Batch and Streaming data processing.
https://beam.apache.org/
Apache License 2.0
7.77k stars 4.21k forks source link

[Bug] [BQ Streaming Inserts]: Whole number in float field is treated like an int #29622

Open ahmedabu98 opened 9 months ago

ahmedabu98 commented 9 months ago

What happened?

When writing to BQ with streaming inserts, we do some serializing to JSON and int values can have a maximum value of < 2^64. Writing ints of higher value than this results in a TypeError: Integer exceeds 64-bit range (as opposed to floats, which can have much higher value).

When writing a whole number to a float field, the IO treats it as an int during serialization. This becomes a problem when trying to right a large whole number as a float because it results in the error mentioned above.

Reproduce with this code:

with beam.Pipeline() as p:
  (p | beam.Create([
      {
        "str": "hello",
        "ts": datetime.datetime(1970, 1, 1, tzinfo=pytz.utc),
        "float": 18446744073709551616
      }]) | beam.io.WriteToBigQuery(
    table="google.com:clouddfe:ahmedabualsaud_test.actual_repro",
    schema="str:STRING,ts:TIMESTAMP,float:FLOAT",
    method="STREAMING_INSERTS"))

Instead of that float value, the following works: "float": 18446744073709551615 (below the 2^64 max in value) "float": 18446744073709551616.0 (explicitly setting as a float)

Stacktrace:

  File "apache_beam/runners/common.py", line 1493, in apache_beam.runners.common.DoFnRunner._invoke_bundle_method
    self._reraise_augmented(exn)
  File "apache_beam/runners/common.py", line 574, in apache_beam.runners.common.DoFnInvoker.invoke_finish_bundle
    def invoke_finish_bundle(self):
  File "apache_beam/runners/common.py", line 580, in apache_beam.runners.common.DoFnInvoker.invoke_finish_bundle
    self.signature.finish_bundle_method.method_value())
  File "/Users/ahmedabualsaud/github/apache/beam/sdks/python/apache_beam/io/gcp/bigquery.py", line 1602, in finish_bundle
    return self._flush_all_batches()
  File "/Users/ahmedabualsaud/github/apache/beam/sdks/python/apache_beam/io/gcp/bigquery.py", line 1610, in _flush_all_batches
    *[
  File "/Users/ahmedabualsaud/github/apache/beam/sdks/python/apache_beam/io/gcp/bigquery.py", line 1611, in <listcomp>
    self._flush_batch(destination)
  File "/Users/ahmedabualsaud/github/apache/beam/sdks/python/apache_beam/io/gcp/bigquery.py", line 1639, in _flush_batch
    passed, errors = self.bigquery_wrapper.insert_rows(
  File "/Users/ahmedabualsaud/github/apache/beam/sdks/python/apache_beam/io/gcp/bigquery_tools.py", line 1264, in insert_rows
    rows = [
  File "/Users/ahmedabualsaud/github/apache/beam/sdks/python/apache_beam/io/gcp/bigquery_tools.py", line 1265, in <listcomp>
    fast_json_loads(fast_json_dumps(r, default=default_encoder))
TypeError: Integer exceeds 64-bit range [while running 'WriteToBigQuery/_StreamToBigQuery/StreamInsertRows/ParDo(BigQueryWriteFn)']

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

johnjcasey commented 9 months ago

We may want to simply not work on this, as Streaming Inserts should not be used in favor of the storage api

johnjcasey commented 9 months ago

In general, streaming inserts issues will not be prioritized, as the storage write api is preferred for streaming writes to BQ.