HTTPArchive / data-pipeline

The new HTTP Archive data pipeline built entirely on GCP
Apache License 2.0
5 stars 0 forks source link

The new schema and cost concerns for users #149

Open tunetheweb opened 1 year ago

tunetheweb commented 1 year ago

So we have a new schema in the all dataset which basically has two tables:

Previously we have had separate schemas for each data type (pages, requests, lighthouse, response bodies, technologies) and also summary tables (summary_pages, summary_requests).

There is a LOT to like about the new schema, including:

And there are some good cost benefits:

So we want to migrate people to the new schema as, in general, it's easier to use, and costs less (in dollars AND time).

However, I have one concern with the new schema and everything being in the one table, as opposed to split out before. This makes the (VERY!) expensive payload, lighthouse and reponse_bodies data much easier to query. I don't think below are good defaults:

image

image

Coming from a more traditional RDMS background, it's quite common in my experience to run a SELECT * on a table to view it, and I worry BigQuery newbies could do this and end up with HUGE bills. BigQuery does have the Schema and Preview options on tables and these are much better than using SELECT * but, as I say, not everyone is a BigQuery user.

We can (and have) insisted on a partition field (date) but BigQuery does not allow us to insist on columns being explicitly given so we cannot prevent people running above sort of SELECT * queries.

We have a number of options to address this concern of mine:

The Views can be created once and will automatically update so I don't think maintenance is an issue.

For example, in the latest schema we currently have three new view that automatically look at the latest month's data and also look at a subset of the data:

CREATE OR REPLACE VIEW `httparchive.latest.pages` AS (
  SELECT
    * EXCEPT (lighthouse)
  FROM
    `httparchive.all.pages`
  WHERE
    -- latest date only (this part not shown for brevity)
)
CREATE OR REPLACE VIEW `httparchive.latest.lighthouse` AS (
  SELECT
    date,
    client,
    page,
    is_root_page,
    root_page,
    rank,
    wptid,
    lighthouse AS report,
    metadata
  FROM
    `httparchive.all.pages`
  WHERE
    -- latest date only (this part not shown for brevity)
)
CREATE OR REPLACE VIEW `httparchive.latest.requests` AS (
  SELECT
    * EXCEPT (response_body)
  FROM
    `httparchive.all.requests`
  WHERE
    -- latest date only (this part not shown for brevity)
)

I'd be interested to hear views on this (@rviscomi I know you have some as we've discussed), and whether we need to do anything for this?

rviscomi commented 1 year ago

Let's use this issue to explore ways to minimize costs related to the summary field of the pages/requests datasets. Some ideas:

tunetheweb commented 1 year ago

Important to note that there are two summary fields - one for page, one for requests.

pages summary column Remove these columns as covered by other columns (e.g. meta data):

Remove these as don't seem to be used anymore, and probably better covered by custom metrics:

There's argument to remove more, but think those are easy, non controversial wins and will half the size of this column (at least).

I also wonder if the features and technologies should be JSON rather than array? Selecting two of these (e.g. website that use Wordpress AND jQuery) is more painful with arrays (requires a join AFAIK?). But on the other hand unnesting with JSON requires a JavaScript function AFAIK so pluses and minuses. However does seem a little inconsistent to use JSON in some places and arrays in others, unless we have a good reason?

requests summary column

Remove these as unlikely to be used and can get from payload if really needed

Remove these as covered by request_headers or response_headers:

Similar to comment above, I wonder if we should use JSON rather than Arrays for Response Headers and Request Headers?

rviscomi commented 1 year ago

I also wonder if the features and technologies should be JSON rather than array? Selecting two of these (e.g. website that use Wordpress AND jQuery) is more painful with arrays (requires a join AFAIK?). But on the other hand unnesting with JSON requires a JavaScript function AFAIK so pluses and minuses. However does seem a little inconsistent to use JSON in some places and arrays in others, unless we have a good reason?

Might not be the best solution, but the WordPress/jQuery example is possible without joins:

WITH pages AS (
  SELECT
    page,
    ARRAY_AGG(t.technology) AS technologies
  FROM
    `httparchive.all.pages` TABLESAMPLE SYSTEM (0.01 PERCENT),
    UNNEST(technologies) AS t
  WHERE
    date = '2022-10-01' AND
    client = 'mobile'
  GROUP BY
    page
)

SELECT
  page
FROM
  pages
WHERE
  'WordPress' IN UNNEST(technologies) AND
  'jQuery' IN UNNEST(technologies)

Also possible to process a JSON-encoded array of technologies without a UDF:

WITH json AS (
  SELECT
    page,
    TO_JSON(technologies) AS technologies
  FROM
    `httparchive.all.pages` TABLESAMPLE SYSTEM (0.01 PERCENT)
  WHERE
    date = '2022-10-01' AND
    client = 'mobile'
),

pages AS (
  SELECT
    page,
    ARRAY_AGG(JSON_VALUE(t, '$.technology')) AS technologies
  FROM
    json,
    UNNEST(JSON_QUERY_ARRAY(technologies, '$')) AS t
  GROUP BY
    page
)

SELECT
  page
FROM
  pages
WHERE
  'WordPress' IN UNNEST(technologies) AND
  'jQuery' IN UNNEST(technologies)

IMO using arrays where possible is more semantic and avoids unnecessary decoding steps. There might be efficiency benefits, but that's not my main motivation. I do think it improves the QX in the general case—querying a single technology at a time or querying all technologies.

tunetheweb commented 1 year ago

Btw, another (admittedly small) grievance with arrays, is it makes the preview more difficult to use, as can't scroll down (e.g. I want a Lighthouse payload, but first one doesn't have what I want, and scrolling down is painful due to arrays of "useless" data). Noticed this type of thing more and more that I use this and kinda annoying. Any easy workaround I'm missing?

I tend to use AND rank = 1000 instead of preview because of this.

max-ostapenko commented 1 week ago

Great insights @tunetheweb, if that's actually not used then we'll be able to go down to:

Here is an estimation query:

CREATE TEMP FUNCTION `prune_object`(
  json_str STRING,
  keys_to_remove ARRAY<STRING>
) RETURNS STRING
LANGUAGE js AS """
  try {
    var jsonObject = JSON.parse(json_str);
    keys_to_remove.forEach(function(key) {
      delete jsonObject[key];
    });
    return JSON.stringify(jsonObject);
  } catch (e) {
    return json_str;
  }
""";

SELECT
  SUM(BIT_COUNT(CAST(summary AS BYTES)))        * 2 / 1024 / 1024 / 1024 AS summary_Gb,
  SUM(BIT_COUNT(CAST(summary_pruned AS BYTES))) * 2 / 1024 / 1024 / 1024 AS summary_pruned_Gb,
  SUM(BIT_COUNT(CAST(summary_pruned AS BYTES))) / SUM(BIT_COUNT(CAST(summary AS BYTES))) AS share
FROM (
  SELECT
    summary,
    prune_object(
      summary,
      ["metadata", "pageid", "createDate", "startedDateTime", "archive", "label", "crawlid", "url", "urlhash", "urlShort", "wptid", "wptrun", "rank", "PageSpeed", "_adult_site", "avg_dom_depth", "doctype", "document_height", "document_width", "localstorage_size", "sessionstorage_size", "meta_viewport", "num_iframes", "num_scripts", "num_scripts_sync", "num_scripts_async", "usertiming"]) as summary_pruned
  FROM `all.pages` TABLESAMPLE SYSTEM (5 PERCENT)
  WHERE date = '2024-08-01'
)