apache / beam

Apache Beam is a unified programming model for Batch and Streaming data processing.
https://beam.apache.org/
Apache License 2.0
7.89k stars 4.27k forks source link

[Bug]: Errors when switching from streaming_inserts to Storage_Write_API when using repeated Struct or Record fields #32155

Open TobiasBredow opened 3 months ago

TobiasBredow commented 3 months ago

What happened?

I noticed that there are some differences when switching over ingestions from the Streaming_Inserts to the Storage_Write_API in the WriteToBigQuery transform. Using the Python API.

Namely in the old ingestion it is possible to pass in empty repeated fields and they will default to an empty list. However that fails as soon as the newer Storage_Write_API is used.

It seems that since it converts the inputs to a beam row to send it to Java api it runs into an error in beam_row_from_dict. Since it converts fields that are not present to None but if that field is a repeated struct or recorde it then fails when trying to access the field line 1596 . Is that new wanted behavior as it forces us to always add an empty list to a dict before sending it to to the write to BigQuery transform? Especially if you have multiple of these fields in a high frequency source they add to the data_processed costs by Dataflow.

I would also be happy to adjust this behavior myself since it looks like a small and easy fix to me. If it is not by design and wanted that the transforms fails in that way with empty repeated fields.

Reproducible by providing no entry for a Repeated Struct or Record field to the WriteToBigQuery Transform while using the Storage_Write_API

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

TobiasBredow commented 3 months ago

Would be happy to also fix it if this is not wanted behavior.

TobiasBredow commented 3 months ago

.take-issue

liferoad commented 3 months ago

cc @ahmedabu98

ahmedabu98 commented 3 months ago

Thanks for opening this bug @TobiasBredow. I did a quick search and the expected behavior you mention matches with BQ's docs: https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#array_nulls. Namely:

When you write a NULL array to a table, it is converted to an empty array.

Please feel free to apply a fix and tag me to review

TobiasBredow commented 3 months ago

Cool thanks for the quick reply! I will nee a bit to setup the environment since it my first time changing beam. I will tag you for the review once I have implemented the fix.

liferoad commented 3 months ago

Cool thanks for the quick reply! I will nee a bit to setup the environment since it my first time changing beam. I will tag you for the review once I have implemented the fix.

Couple of useful links if you haven't checked them:

https://github.com/apache/beam/blob/master/contributor-docs/code-change-guide.md

https://github.com/apache/beam/blob/master/CONTRIBUTING.md

https://beam.apache.org/contribute/

Welcome to Beam!