abcxyz / github-metrics-aggregator

Apache License 2.0
11 stars 3 forks source link

Unique Events View produces incorrect values #130

Closed lock14 closed 9 months ago

lock14 commented 10 months ago

TL;DR

The Unique Events view produces incorrect results. It counts events as unique when it shouldn't.

Expected behavior

Through testing it was confirmed that delivery_id is a unique identifier for each event (chat disucssion: https://chat.google.com/room/AAAARN8nfyk/1FpSGdRWuu8)

Thus the number of unique events should be equal to the number of unique delivery_id. This is not the case though.

Observed behavior

Running the following query show that the unique_events view over counts the events:

SELECT
  first_count, second_count, first_count = second_count counts_equal
FROM (
  SELECT
    COUNT(*) first_count
  FROM
    `github-metrics-aggreg-p-109ae8.github_metrics.unique_events`)
CROSS JOIN (
  SELECT
    COUNT(DISTINCT delivery_id) second_count
  FROM
    `github-metrics-aggreg-p-109ae8.github_metrics.events`);

Environment Details

This was in the production instance of GMA, project `github-metrics-aggreg-p-109ae8`

Additional information

The following query will correctly de-duplicate events based on the delivery_id

SELECT
  delivery_id,
  event,
  signature,
  received,
  payload
FROM (
  SELECT
    ROW_NUMBER() OVER (PARTITION BY delivery_id ORDER BY received DESC) AS row_id,
    *
  FROM
    `github-metrics-aggreg-p-109ae8.github_metrics.events` )
WHERE row_id = 1;

This produces the correct number of events which can be verified with the following query:

SELECT
  first_count,
  second_count,
  first_count = second_count counts_equal
FROM (
  SELECT
    COUNT(*) first_count
  FROM (
    SELECT
      delivery_id,
      event,
      signature,
      received,
      payload
    FROM (
      SELECT
        ROW_NUMBER() OVER (PARTITION BY delivery_id ORDER BY received DESC) AS row_id,
        *
      FROM
        `github-metrics-aggreg-p-109ae8.github_metrics.events` )
    WHERE
      row_id = 1 ))
CROSS JOIN (
  SELECT
    COUNT(DISTINCT delivery_id) second_count
  FROM
    `github-metrics-aggreg-p-109ae8.github_metrics.events`);
lock14 commented 10 months ago

This bug will be affected by the work @pdewilde is currently doing to optimize the BigQuery queries. Fixing this bug should be put on hold until he is finished as it will affect how we want to go about getting unique events.