airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
15.56k stars 4.01k forks source link

BigQuery Struct destination nulls out fields in repeated structs #4487

Closed zestyping closed 3 years ago

zestyping commented 3 years ago

Environment

Current Behavior

In the Facebook Marketing data, there's a record that contains a field named targeting, whose value is a structure containing a field named exclusions, whose value is a structure containing a field named interests, whose value is an array of {"id": ..., "name": ...} structures.

I can confirm the data is present because it shows up when I use the original, "flattened" BigQuery destination. When I transfer the data using the flattened BigQuery destination, and then query for the record in BigQuery like this:

SELECT * FROM `facebook_raw.ads` WHERE id = '23847599448580278' ORDER BY _airbyte_emitted_at DESC LIMIT 1;

...BigQuery gives me this JSON result:

[
  {
    "id": "23847599448580278",
    "name": "Image4",
    "status": "PAUSED",
    "adlabels": [],
    "adset_id": "23847599448570278",
    "bid_info": null,
    "bid_type": "ABSOLUTE_OCPM",
    "creative": "{\"id\":\"23847605777030278\"}",
    "targeting": "{\"age_max\":35,\"age_min\":18,\"custom_audiences\":[{\"id\":\"23847589512650278\",\"name\":\"Lookalike (10%) - IG Profile Engagers\"},{\"id\":\"23847589512660278\",\"name\":\"Lookalike (10%) - FB Page Visitors\"}],\"exclusions\":{\"interests\":[{\"id\":\"6003012399881\",\"name\":\"Mark Levin\"},{\"id\":\"6003060009815\",\"name\":\"NRA's American Rifleman (Official)\"},{\"id\":\"6003068592093\",\"name\":\"Fox News Radio\"},{\"id\":\"6003089254597\",\"name\":\"The Mark Levin Show\"},{\"id\":\"6003108316584\",\"name\":\"Fox News Channel\"},{\"id\":\"6003265249086\",\"name\":\"Bill O'Reilly (political commentator)\"},{\"id\":\"6003270662648\",\"name\":\"Eric Trump\"},{\"id\":\"6003284135676\",\"name\":\"Rush Limbaugh\"},{\"id\":\"6003327648848\",\"name\":\"The Sean Hannity Show\"},{\"id\":\"6003331802697\",\"name\":\"Fox News Opinion\"},{\"id\":\"6003332337177\",\"name\":\"Fox News Sunday\"},{\"id\":\"6003354144727\",\"name\":\"Judicial Watch\"},{\"id\":\"6003393663563\",\"name\":\"Ivanka Trump\"},{\"id\":\"6003421411431\",\"name\":\"Tea party\"}]},\"geo_locations\":{\"countries\":[\"US\"],\"location_types\":[\"home\",\"recent\"]},\"locales\":[6],\"targeting_optimization\":\"none\",\"publisher_platforms\":[\"facebook\",\"instagram\"],\"facebook_positions\":[\"feed\",\"story\"],\"instagram_positions\":[\"stream\",\"story\"],\"device_platforms\":[\"mobile\",\"desktop\"]}",
    "account_id": "1211960785929066",
    "bid_amount": null,
    "campaign_id": "23847599448600278",
    "created_time": "2021-05-12T11:15:24-0400",
    "source_ad_id": "23847599387070278",
    "updated_time": "2021-06-16T17:49:15-0400",
    "tracking_specs": [
      "{\"action.type\":[\"post_engagement\"],\"page\":[\"107364421446709\"],\"post\":[\"131862565663561\"]}",
      "{\"action.type\":[\"offsite_conversion\"],\"fb_pixel\":[\"380363403227494\"]}"
    ],
    "recommendations": [
      "{\"title\":\"Ad isn't optimised for conversions\",\"message\":\"Your tracking pixel is active, but your ad isn't optimised for off-site conversions. You may get better results if you choose off-site conversions as your optimisation goal.\",\"code\":1942006,\"importance\":\"HIGH\",\"confidence\":\"HIGH\",\"blame_field\":\"tracking_specs\"}"
    ],
    "conversion_specs": [
      "{\"action.type\":[\"link_click\"],\"post\":[\"131862565663561\"],\"post.wall\":[\"107364421446709\"]}"
    ],
    "effective_status": "PAUSED",
    "last_updated_by_app_id": "119211728144504",
    "_airbyte_emitted_at": "2021-06-24 21:54:52 UTC",
    "_airbyte_ads_hashid": "e10832663f3bda00d7486722a78b3e01"
  }
]

Above you can see the contents of the targeting field represented as a JSON string, as you would expect from the normalized BigQuery destination.

I've repeated the transfer using the new BigQuery Struct destination, putting the structured data in a dataset named facebook_struct_raw. When I look at the results, the targeting.exclusion.interests array still contains many elements with the {"id": ..., "name": ...} shape, but they are full of nulls. If I do a similar query on the structured dataset:

SELECT * FROM `facebook_struct_raw.ads` WHERE id = '23847599448580278' ORDER BY _airbyte_emitted_at DESC LIMIT 1;

...BigQuery gives me this:

[
  {
    "id": "23847599448580278",
    "name": "Image4",
    "status": "PAUSED",
    "adlabels": null,
    "adset_id": "23847599448570278",
    "bid_info": null,
    "bid_type": "ABSOLUTE_OCPM",
    "creative": {
      "id": "23847605777030278",
      "creative_id": null
    },
    "targeting": {
      "moms": null,
      "income": null,
      "age_max": "35",
      "age_min": "18",
      "genders": null,
      "locales": {
        "value": [
          "6"
        ]
      },
      "user_os": null,
      "politics": null,
      "behaviors": null,
      "home_type": null,
      "interests": null,
      "net_worth": null,
      "exclusions": {
        "moms": null,
        "income": null,
        "politics": null,
        "behaviors": null,
        "home_type": null,
        "interests": {
          "value": [
            {
              "id": null,
              "name": null
            },
            {
              "id": null,
              "name": null
            },
            {
              "id": null,
              "name": null
            },
            {
              "id": null,
              "name": null
            },
            {
              "id": null,
              "name": null
            },
            {
              "id": null,
              "name": null
            },
            {
              "id": null,
              "name": null
            },
            {
              "id": null,
              "name": null
            },
            {
              "id": null,
              "name": null
            },
            {
              "id": null,
              "name": null
            },
            {
              "id": null,
              "name": null
            },
            {
              "id": null,
              "name": null
            },
            {
              "id": null,
              "name": null
            },
            {
              "id": null,
              "name": null
            }
          ]
        },
        "net_worth": null,
        "generation": null,
        "industries": null,
        "connections": null,
        "life_events": null,
        "home_ownership": null,
        "work_employers": null,
        "work_positions": null,
        "family_statuses": null,
        "user_adclusters": null,
        "custom_audiences": null,
        "education_majors": null,
        "excluded_connections": null,
        "household_composition": null,
        "relationship_statuses": null,
        "friends_of_connections": null,
        "excluded_custom_audiences": null
      },
      "generation": null,
      "industries": null,
      "connections": null,
      "office_type": null,
      "user_device": null,
      "flexible_spec": null,
      "geo_locations": {
        "zips": null,
        "cities": null,
        "regions": null,
        "countries": {
          "value": [
            "US"
          ]
        },
        "geo_markets": null,
        "country_groups": null,
        "location_types": {
          "value": [
            "home",
            "recent"
          ]
        },
        "custom_locations": null
      },
      "interested_in": null,
      "home_ownership": null,
      "work_positions": null,
      "family_statuses": null,
      "user_adclusters": null,
      "custom_audiences": {
        "value": [
          {
            "id": null,
            "name": null
          },
          {
            "id": null,
            "name": null
          }
        ]
      },
      "device_platforms": {
        "value": [
          "mobile",
          "desktop"
        ]
      },
      "app_install_state": null,
      "education_statuses": null,
      "facebook_positions": {
        "value": [
          "feed",
          "story"
        ]
      },
      "instagram_positions": {
        "value": [
          "stream",
          "story"
        ]
      },
      "messenger_positions": null,
      "publisher_platforms": {
        "value": [
          "facebook",
          "instagram"
        ]
      },
      "excluded_connections": null,
      "excluded_user_device": null,
      "relationship_statuses": null,
      "excluded_geo_locations": null,
      "friends_of_connections": null,
      "targeting_optimization": "none",
      "excluded_custom_audiences": null,
      "audience_network_positions": null,
      "excluded_publisher_categories": null
    },
    "account_id": "1211960785929066",
    "bid_amount": null,
    "campaign_id": "23847599448600278",
    "created_time": "2021-05-12T11:15:24-0400",
    "source_ad_id": "23847599387070278",
    "updated_time": "2021-06-16T17:49:15-0400",
    "tracking_specs": {
      "value": [
        {
          "page": null,
          "post": null,
          "event": null,
          "offer": null,
          "object": null,
          "dataset": null,
          "leadgen": null,
          "subtype": null,
          "creative": null,
          "fb_pixel": null,
          "question": null,
          "response": null,
          "post_wall": null,
          "event_type": null,
          "action_type": null,
          "application": null,
          "page_parent": null,
          "post_object": null,
          "conversion_id": null,
          "event_creator": null,
          "object_domain": null,
          "offer_creator": null,
          "offsite_pixel": null,
          "fb_pixel_event": null,
          "post_object_wall": null,
          "question_creator": null
        },
        {
          "page": null,
          "post": null,
          "event": null,
          "offer": null,
          "object": null,
          "dataset": null,
          "leadgen": null,
          "subtype": null,
          "creative": null,
          "fb_pixel": null,
          "question": null,
          "response": null,
          "post_wall": null,
          "event_type": null,
          "action_type": null,
          "application": null,
          "page_parent": null,
          "post_object": null,
          "conversion_id": null,
          "event_creator": null,
          "object_domain": null,
          "offer_creator": null,
          "offsite_pixel": null,
          "fb_pixel_event": null,
          "post_object_wall": null,
          "question_creator": null
        }
      ]
    },
    "recommendations": {
      "value": [
        {
          "code": null,
          "title": null,
          "message": null,
          "confidence": null,
          "importance": null,
          "blame_field": null
        }
      ]
    },
    "conversion_specs": {
      "value": [
        {
          "page": null,
          "post": null,
          "event": null,
          "offer": null,
          "object": null,
          "dataset": null,
          "leadgen": null,
          "subtype": null,
          "creative": null,
          "fb_pixel": null,
          "question": null,
          "response": null,
          "post_wall": null,
          "event_type": null,
          "action_type": null,
          "application": null,
          "page_parent": null,
          "post_object": null,
          "conversion_id": null,
          "event_creator": null,
          "object_domain": null,
          "offer_creator": null,
          "offsite_pixel": null,
          "fb_pixel_event": null,
          "post_object_wall": null,
          "question_creator": null
        }
      ]
    },
    "effective_status": "PAUSED",
    "last_updated_by_app_id": "119211728144504",
    "_airbyte_ab_id": "14fc0b01-9e47-4d96-b147-e4b9e15e8ea7",
    "_airbyte_emitted_at": "2021-07-01 18:57:20 UTC"
  }
]

So it looks like the structure is accurately preserved, but only the top-level fields contain values, and many (but not all!) of the rest contain nulls.

Expected Behavior

The data values should be preserved in the destination along with the original structure. In other words, the output from the above two queries should represent the same information.

If it helps to specify the correctness criterion in a precise way, I'd say:

Logs

logs-486-0.txt

Steps to Reproduce

  1. Set up a Facebook Marketing source.
  2. Create a connection from the source to a BigQuery flattened destination, and run a sync.
  3. Create a connection from the same source to a BigQuery structured destination, and run a sync.
  4. Compare the results by querying an individual record from each table in the BigQuery console (see example above) and selecting the JSON version of the output for easier comparison.

Are you willing to submit a PR?

I'm willing to help look into it, but I don't know where to start yet.

sherifnada commented 3 years ago

@ChristopheDuong to add any necessary descriptions/acceptance criteria then triage

ChristopheDuong commented 3 years ago

the correctness criterion as described by @zestyping:

From @zestyping's observations, it seems the conversion of the catalog is properly done and tables are well created by: https://github.com/airbytehq/airbyte/blob/937f85fc12648114fdf17461625bb1e53378e796/airbyte-integrations/connectors/destination-bigquery-denormalized/src/main/java/io/airbyte/integrations/destination/bigquery/BigQueryDenormalizedDestination.java#L79

However, populating data into the table may not be done properly, so it should be verified if the formatting of the record data is properly done in: https://github.com/airbytehq/airbyte/blob/937f85fc12648114fdf17461625bb1e53378e796/airbyte-integrations/connectors/destination-bigquery-denormalized/src/main/java/io/airbyte/integrations/destination/bigquery/BigQueryDenormalizedRecordConsumer.java#L85

DoNotPanicUA commented 3 years ago

Update. The current implementation of the BigQuery De-normalized destination doesn't cover complex objects properly. The main problem is that table creation can't properly identify Record structure. You can find warnings in the log like that: 2021-07-02 00:57:21 INFO () DefaultAirbyteStreamFactory(lambda$create$0):73 - 2021-07-02 00:57:21 WARN i.a.i.d.b.BigQueryDenormalizedDestination(getTypes):157 - {} - Field home_type has no type defined, defaulting to STRING After wrongly created table, processing ignores all incoming values out of the created structure. Errors like that: 2021-07-02 00:57:31 INFO () DefaultAirbyteStreamFactory(lambda$create$0):73 - 2021-07-02 00:57:31 WARN i.a.i.d.b.BigQueryDenormalizedRecordConsumer(lambda$formatData$0):92 - {} - Ignoring field action_type as it is not defined in catalog As summary, the destination requires significant rework. I will keep you posted about this process.

DoNotPanicUA commented 3 years ago

Update. There is a general issue with a destination that stores nested types. There are only a few sources that provide enough metadata for storing that data right. This problem will be fixed out of this issue scope. (Discussed with @sherifnada)

In the case of processing data from Facebook to BigQuery, we have a gap when we process Array of Objects.

sherifnada commented 3 years ago

@zestyping give the new version a shot and let us know how it goes!

zestyping commented 3 years ago

@sherifnada Awesome news. Thanks, will do!