GoogleCloudPlatform / bigquery-utils

Useful scripts, udfs, views, and other utilities for migration and data warehouse operations in BigQuery.
https://cloud.google.com/bigquery/
Apache License 2.0
1.07k stars 269 forks source link

Request: Handling for objects in json_extract_values function #350

Closed sgoley closed 2 months ago

sgoley commented 1 year ago

Issue: bqutils.fn.json_extract_values returns unusable values [object Object] when children are objects / sub collections.

Example: image

Sample query:

-- (select body from`bigquery-public-data.geo_whos_on_first.geojson` limit 100 )

with geojson as 
(select """{"id": 1259392143, "type": "Feature", "properties": {"edtf:cessation": "uuuu", "edtf:inception": "uuuu", "geom:area": 0.0, "geom:area_square_m": 0.0, "geom:bbox": "116.97547,32.41412,116.97547,32.41412", "geom:latitude": 32.41412, "geom:longitude": 116.97547, "gn:admin1_code": "1.0", "gn:asciiname": "Liuyingcun", "gn:country_code": "CN", "gn:dem": 25, "gn:feature_class": "P", "gn:feature_code": "PPL", "gn:geonameid": 10417093, "gn:latitude": 32.41412, "gn:longitude": 116.97547, "gn:modification_date": "2015-08-08", "gn:name": "Liuyingcun", "gn:population": 0, "gn:timezone": "Asia/Shanghai", "iso:country": "CN", "mz:hierarchy_label": 1, "mz:is_current": -1, "name:zho_x_preferred": ["\u5218\u90e2\u6751"], "src:geom": "geonames", "wof:belongsto": [85669739, 102191569, 85632695, 136253041, 890516233], "wof:breaches": [], "wof:concordances": {"gn:id": 10417093}, "wof:country": "CN", "wof:geomhash": "cabdc55714fc98313d2246b1c4f7fb0b", "wof:hierarchy": [{"continent_id": 102191569, "country_id": 85632695, "county_id": 890516233, "empire_id": 136253041, "locality_id": 1259392143, "region_id": 85669739}], "wof:id": 1259392143, "wof:lastmodified": 1537613321, "wof:name": "Liuyingcun", "wof:parent_id": 890516233, "wof:placetype": "locality", "wof:repo": "whosonfirst-data", "wof:superseded_by": [], "wof:supersedes": [], "wof:tags": []}, "bbox": [116.97547, 32.41412, 116.97547, 32.41412], "geometry": {"coordinates": [116.97547, 32.41412], "type": "Point"}}""" as body)

SELECT bqutil.fn.json_extract_keys(body) as jkeys, bqutil.fn.json_extract_values(body) as jvalues
FROM geojson

I used a single row here because the public dataset is 14GB and unpartitioned.

Desired Output: bqutils.fn.json_extract_values returns a usable value (even just a json string) of the contents of those objects.

Thanks!

sgoley commented 1 year ago

Alternative might be to build something like this as a whole separate UDF:

-- extract all key value pairs as an array from a json dict
-- input: json string with a dictionary
-- returns:  list of struct <key, value>
CREATE TEMP  FUNCTION EXTRACT_KV_PAIRS(json_str STRING)
RETURNS ARRAY<STRUCT<key STRING, value STRING>>
LANGUAGE js AS """
  try{ 
    const json_dict = JSON.parse(json_str); 
    const all_kv = Object.entries(json_dict).map(
        (r)=>Object.fromEntries([["key", r[0]],["value",  
                                   JSON.stringify(r[1])]]));
    return all_kv;
  } catch(e) { return [{"key": "error","value": e}];}
""";

Source: https://medium.com/google-cloud/extracting-json-key-value-pairs-in-bigquery-1bb9d0ec0b6d

afleisc commented 2 months ago

@plaflamme Just contributed the json_extract_key_value_pairs function in #408 which should help with this use case, please let us know if there are any issues