Open TobiasBredow opened 3 months ago
Would be happy to also fix it if this is not wanted behavior.
.take-issue
cc @ahmedabu98
Thanks for opening this bug @TobiasBredow. I did a quick search and the expected behavior you mention matches with BQ's docs: https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#array_nulls. Namely:
When you write a NULL array to a table, it is converted to an empty array.
Please feel free to apply a fix and tag me to review
Cool thanks for the quick reply! I will nee a bit to setup the environment since it my first time changing beam. I will tag you for the review once I have implemented the fix.
Cool thanks for the quick reply! I will nee a bit to setup the environment since it my first time changing beam. I will tag you for the review once I have implemented the fix.
Couple of useful links if you haven't checked them:
https://github.com/apache/beam/blob/master/contributor-docs/code-change-guide.md
https://github.com/apache/beam/blob/master/CONTRIBUTING.md
https://beam.apache.org/contribute/
Welcome to Beam!
What happened?
I noticed that there are some differences when switching over ingestions from the Streaming_Inserts to the Storage_Write_API in the WriteToBigQuery transform. Using the Python API.
Namely in the old ingestion it is possible to pass in empty repeated fields and they will default to an empty list. However that fails as soon as the newer Storage_Write_API is used.
It seems that since it converts the inputs to a beam row to send it to Java api it runs into an error in beam_row_from_dict. Since it converts fields that are not present to None but if that field is a repeated struct or recorde it then fails when trying to access the field line 1596 . Is that new wanted behavior as it forces us to always add an empty list to a dict before sending it to to the write to BigQuery transform? Especially if you have multiple of these fields in a high frequency source they add to the data_processed costs by Dataflow.
I would also be happy to adjust this behavior myself since it looks like a small and easy fix to me. If it is not by design and wanted that the transforms fails in that way with empty repeated fields.
Reproducible by providing no entry for a Repeated Struct or Record field to the WriteToBigQuery Transform while using the Storage_Write_API
Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
Issue Components