apache / beam

Apache Beam is a unified programming model for Batch and Streaming data processing.
https://beam.apache.org/
Apache License 2.0
7.77k stars 4.21k forks source link

[Bug]: PYTHON SDK / WriteToBigquery with STORAGE_WRITE_API fails when NULLABLE nested RECORD #30753

Open jledrumics opened 5 months ago

jledrumics commented 5 months ago

What happened?

Python Apache BEAM version 2.44

When using WriteToBigquery with STORAGE_WRITE_API method and some NULLABLE nested field, even if the field is not in the input dict, or is None, the code will try to resolve the inner fields and then fails. This is in the beam_row_from_dict method.

A failing example there :

from apache_beam.io.gcp.bigquery_tools import beam_row_from_dict

schema = {
    "fields": [
        {"name": "log_id", "type": "STRING", "mode": "REQUIRED"},
        {
            "name": "nested",
            "type": "RECORD",
            "mode": "NULLABLE",
            "fields": [
                {"name": "id", "type": "STRING", "mode": "REQUIRED"},
                {"name": "source", "type": "STRING", "mode": "REQUIRED"},
                {"name": "channel", "type": "STRING", "mode": "NULLABLE"},
            ],
        },
    ]
}

row = {
    "log_id": "727254-32022246-026",
    "nested": None,  # same when quoting the field
}

beam_row = beam_row_from_dict(row, schema)

print(beam_row)

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

liferoad commented 5 months ago

I suggest you just do this:

row = {
    "log_id": "727254-32022246-026",
    "nested": {"id": None, "source": None, "channel":None},
}