dbt-labs / dbt-bigquery

dbt-bigquery contains all of the code required to make dbt operate on a BigQuery database.
https://github.com/dbt-labs/dbt-bigquery
Apache License 2.0
210 stars 147 forks source link

[Bug] Error when loading NULL result from JSON column #1101

Closed Gilby216 closed 2 weeks ago

Gilby216 commented 6 months ago

Is this a new bug in dbt-bigquery?

Current Behavior

When I run the following query in dbt Cloud I get an error: Select parse_json(NULL)

Error encountered: Runtime Error the JSON object must be str, bytes or bytearray, not NoneType

However that same query runs successfully when I run it directly in the BigQuery console. It outputs a SQL NULL value.

The query above is the simplest reproduce case. If it's useful, the actual situation I am trying to solve for here is to query into a JSON column and extract certain fields, but sometimes those fields are missing. To illustrate this more concretely, the following SQL also produces the same error in DBT Cloud:

Select parsed_json_data.ex1 from ( Select parse_json('{"ex2":"example"}') AS parsed_json_data ) AS t1

But again, that query above runs successfully directly in BigQuery console and outputs a NULL value.

If you change this query to extract the field that does exist in the JSON then everything works fine in both DBT Cloud and BigQuery console:

Select parsed_json_data.ex2 from ( Select parse_json('{"ex2":"example"}') AS parsed_json_data ) AS t1

Expected Behavior

NULL values returned from JSON columns should not error, they should output NULL.

Steps To Reproduce

In DBT Cloud, preview the following query/model: Select parse_json(NULL)

Relevant log output

Log output from DBT Cloud:

Sending event: {'category': 'dbt', 'action': 'invocation', 'label': 'start', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7f77c97e4a00>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7f77a87e18e0>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7f77a87e1a90>]}
Running with dbt=1.6.9
running dbt with arguments {'printer_width': '80', 'indirect_selection': 'empty', 'log_cache_events': 'False', 'write_json': 'True', 'partial_parse': 'True', 'cache_selected_only': 'False', 'warn_error': 'None', 'fail_fast': 'False', 'version_check': 'True', 'log_path': '/usr/src/dbt-server-shared/working_dir/a5144474-69cd-4d3d-86cb-29bc7bb69e24', 'profiles_dir': '/usr/src/dbt-server-shared/.dbt', 'debug': 'False', 'use_colors': 'True', 'use_experimental_parser': 'False', 'no_print': 'None', 'quiet': 'True', 'warn_error_options': 'WarnErrorOptions(include=[], exclude=[])', 'invocation_command': 'dbt -A dbt_worker.app worker --loglevel=DEBUG --concurrency=2 --max-memory-per-child=500000', 'introspect': 'True', 'static_parser': 'True', 'target_path': 'None', 'log_format': 'json', 'send_anonymous_usage_stats': 'True'}
Sending event: {'category': 'dbt', 'action': 'project_id', 'label': 'f0da6056-b33d-4df6-bf67-b46b86b15ec0', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7f77a84d6a00>]}
Sending event: {'category': 'dbt', 'action': 'adapter_info', 'label': 'f0da6056-b33d-4df6-bf67-b46b86b15ec0', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7f77a84d63a0>]}
Registered adapter: bigquery=1.6.9
Sending event: {'category': 'dbt', 'action': 'resource_counts', 'label': 'f0da6056-b33d-4df6-bf67-b46b86b15ec0', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7f77e1bffac0>]}
Found 64 models, 2 tests, 9 analyses, 9 sources, 0 exposures, 0 metrics, 511 macros, 0 groups, 0 semantic models
Sending event: {'category': 'dbt', 'action': 'runnable_timing', 'label': 'f0da6056-b33d-4df6-bf67-b46b86b15ec0', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7f77e1bff850>]}
Acquiring new bigquery connection 'master'
Concurrency: 4 threads (target='default')
Traceback (most recent call last):
  File "/venv/dbt-1.6.0-latest/lib/python3.8/site-packages/dbt/task/base.py", line 372, in safe_run
    result = self.compile_and_execute(manifest, ctx)
  File "/venv/dbt-1.6.0-latest/lib/python3.8/site-packages/dbt/task/base.py", line 323, in compile_and_execute
    result = self.run(ctx.node, manifest)
  File "/venv/dbt-1.6.0-latest/lib/python3.8/site-packages/dbt/task/base.py", line 419, in run
    return self.execute(compiled_node, manifest)
  File "/venv/dbt-1.6.0-latest/lib/python3.8/site-packages/dbt/task/show.py", line 38, in execute
    adapter_response, execute_result = self.adapter.execute(
  File "/venv/dbt-1.6.0-latest/lib/python3.8/site-packages/dbt/adapters/base/impl.py", line 295, in execute
    return self.connections.execute(sql=sql, auto_begin=auto_begin, fetch=fetch, limit=limit)
  File "/venv/dbt-1.6.0-latest/lib/python3.8/site-packages/dbt/adapters/bigquery/connections.py", line 511, in execute
    table = self.get_table_from_response(iterator)
  File "/venv/dbt-1.6.0-latest/lib/python3.8/site-packages/dbt/adapters/bigquery/connections.py", line 439, in get_table_from_response
    return agate_helper.table_from_data_flat(resp, column_names)
  File "/venv/dbt-1.6.0-latest/lib/python3.8/site-packages/dbt/clients/agate_helper.py", line 121, in table_from_data_flat
    for _row in data:
  File "/venv/dbt-1.6.0-latest/lib/python3.8/site-packages/google/api_core/page_iterator.py", line 209, in _items_iter
    for item in page:
  File "/venv/dbt-1.6.0-latest/lib/python3.8/site-packages/google/api_core/page_iterator.py", line 131, in __next__
    result = self._item_to_value(self._parent, item)
  File "/venv/dbt-1.6.0-latest/lib/python3.8/site-packages/google/cloud/bigquery/table.py", line 3106, in _item_to_row
    _helpers._row_tuple_from_json(resource, iterator.schema),
  File "/venv/dbt-1.6.0-latest/lib/python3.8/site-packages/google/cloud/bigquery/_helpers.py", line 313, in _row_tuple_from_json
    row_data.append(_field_from_json(cell["v"], field))
  File "/venv/dbt-1.6.0-latest/lib/python3.8/site-packages/google/cloud/bigquery/_helpers.py", line 289, in _field_from_json
    return converter(resource, field)
  File "/venv/dbt-1.6.0-latest/lib/python3.8/site-packages/dbt/adapters/bigquery/connections.py", line 70, in _json_from_json
    return json.loads(value)
  File "/usr/lib/python3.8/json/__init__.py", line 341, in loads
    raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not NoneType
Connection 'master' was properly closed.
Connection 'sql_operation.playback_data_warehouse.inline_query' was properly closed.
Encountered an error:
Runtime Error
  the JSON object must be str, bytes or bytearray, not NoneType
Command `cli show` failed at 17:55:00.739276 after 1.01 seconds
Sending event: {'category': 'dbt', 'action': 'invocation', 'label': 'end', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7f77c97e4a00>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7f77a87e1970>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7f77c8614610>]}
Flushing usage events
Runtime Error
  the JSON object must be str, bytes or bytearray, not NoneType

Environment

- Python: 3.8
- dbt-core: dbt-1.6.9

Additional Context

No response

dbeatty10 commented 6 months ago

Thanks for reporting this @Gilby216!

TLDR

I think https://github.com/dbt-labs/dbt-bigquery/pull/1061 will fix this.

Root cause

We've had a couple issues that look similar to this, namely https://github.com/dbt-labs/dbt-bigquery/issues/972 and https://github.com/dbt-labs/dbt-bigquery/issues/1055.

It looks like the underlying issue for this one is within our fix added in https://github.com/dbt-labs/dbt-bigquery/pull/974.

When I tried the following monkey patch instead, it worked:

def _json_from_json(value, _):
    """NOOP string -> string coercion"""

    return None if value is None else json.loads(value)

Solution

The fix might be covered by https://github.com/dbt-labs/dbt-bigquery/pull/1061, which removes our monkey patch in favor of relying on https://github.com/googleapis/python-bigquery/pull/1587 instead.

We should add the following reproduction case to a test case within https://github.com/dbt-labs/dbt-bigquery/pull/1061 to confirm.

When I removed the monkey patch in my local environment, it worked.

Reprex

Run the following while using dbt-bigquery:

dbt show --inline 'select parse_json(null)'

Get this error:

01:36:38  Unhandled error while executing 
the JSON object must be str, bytes or bytearray, not NoneType
01:36:38  Encountered an error:
Runtime Error
  the JSON object must be str, bytes or bytearray, not NoneType

Workaround

In the meantime, @jeremyyeo mentioned within https://github.com/dbt-labs/dbt-bigquery/issues/1055 that you can coerce it into to_json(null) to handle this edge case:

Select ifnull(parse_json(NULL), to_json(null))
github-actions[bot] commented 3 weeks ago

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions[bot] commented 2 weeks ago

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.