HTTPArchive / data-pipeline

The new HTTP Archive data pipeline built entirely on GCP
Apache License 2.0
5 stars 0 forks source link

Convert the `latest` dataset to views #141

Closed rviscomi closed 2 weeks ago

rviscomi commented 2 years ago

We have scheduled queries that generate the tables in the httparchive.latest dataset. These tables currently have no content, due to an unknown bug. Can we leverage views and the new partitioned all dataset to make this process more streamlined and maintenance-free?

tunetheweb commented 2 years ago

As investigated in #142 this should be possible, with the exception that they cannot be used in wildcard queries.

My main concerns would be:

One possibility is to create a httparchive.all.latest_date table with one date (e.g. 2022-09-01) and then create the view using something like:

CREATE OR REPLACE VIEW `httparchive.scratchspace.test_latest_desktop` AS (
  SELECT
    date,
    client
    ....
  FROM
    `httparchive.all.pages`
  JOIN
    `httparchive.all.latest_date`
  USING (date)
  WHERE
    client = 'desktop'
);

And then could just update httparchive.all.latest_date at the end of each month's run to signal that data is now ready, and all the views referencing that would automatically switch to the new date.

I ran this, and it seemed to work:

CREATE OR REPLACE VIEW `httparchive.scratchspace.test_latest_date` AS (
  SELECT CAST('2022-08-01' AS DATE) AS date
)

CREATE OR REPLACE VIEW `httparchive.scratchspace.test_latest_desktop` AS (
  SELECT
    date,
    client,
    page,
    rank,
    payload
  FROM
    `httparchive.all.pages`
  JOIN
    `httparchive.scratchspace.test_latest_date`
  USING (date)
  WHERE
    client = 'desktop'
);

and I get these costs:

SELECT page FROM `httparchive.scratchspace.test_latest_desktop`; -- 1.13GB
SELECT page, rank FROM `httparchive.scratchspace.test_latest_desktop`; -- 1.27GB
SELECT page, rank, payload FROM `httparchive.scratchspace.test_latest_desktop`; -- 4.79 TB
SELECT page, rank, payload FROM `httparchive.scratchspace.test_latest_desktop` WHERE rank = 1000; -- 532.58MB
rviscomi commented 2 years ago

My main concerns would be:

  • The wildcard issue
  • The all datastream is streaming, so would need to update the views after that's finished to avoid people querying half datasets.

Could you clarify if the wildcard issue applies to the all dataset? Unlike the dated tables like pages.2022_09_01 that need wildcards like pages.* to process multiple dates, the all.pages table is partitioned by date.

Also, we've recently switched from streaming to batch inserts due to maintenance and data quality complexities, so partial datasets are no longer a concern.

tunetheweb commented 2 years ago

Could you clarify if the wildcard issue applies to the all dataset? Unlike the dated tables like pages.2022_09_01 that need wildcards like pages.* to process multiple dates, the all.pages table is partitioned by date.

No, the wildcard issue only applies if And httparchive.latest.summary_pages_desktop and httparchive.latest.summary_pages_mobile are views on the all dataset and you want to run this:

SELECT
  _TABLE_SUFFIX AS client,
  col1
FROM
  `httparchive.latest.summary_pages_*`

That will not work, as it's using a wildcard on two views.

But if you did this to query one table, it would work fine:

SELECT
  col1
FROM
  ``httparchive.latest.summary_pages_desktop

Also, we've recently switched from streaming to batch inserts due to maintenance and data quality complexities, so partial datasets are no longer a concern.

Oh yeah keep forgetting this!

We'd still need to redefine the latest views each month as part of the batch after the data is loaded to look at the latest date.

I tried doing this, but it didn't work:

CREATE OR REPLACE VIEW `httparchive.scratchspace.test_latest_desktop2` AS (
  SELECT
    date,
    client,
    page,
    rank,
    payload
  FROM
    `httparchive.all.pages`
  WHERE
    client = 'desktop' AND
    date = (SELECT max(date) from `httparchive.all.pages`)
);

SELECT client FROM `httparchive.scratchspace.test_latest_desktop2`;

The last SELECT complains of a missing required date param.

Where as hard coding the date (AND date = 2022-08-01') in the view definition, or joining to alatest_date` table, allows the last SELECT to run fine.

rviscomi commented 2 years ago

The all tables are clustered by client, so is there any reason not to have a single view for each legacy table type, for example latest.summary_pages? The user could filter to a client with WHERE client = 'desktop' if needed. Based on your other experiment, it seems like it should have the same performance gains as applying the filter on the clustered table itself.

Alternatively, users can UNION ALL both desktop/mobile views together if needed.

The last SELECT complains of a missing required date param.

Try something like

SELECT max(date) from `httparchive.all.pages` WHERE date > '2000-01-01'

But I think there's a better way using INFORMATION_SCHEMA.PARTITIONS to query the metadata directly:

SELECT
  MAX(partition_id)
FROM
  `httparchive.all.INFORMATION_SCHEMA.PARTITIONS`
WHERE
  table_name = 'pages' AND
  partition_id != '__NULL__'
tunetheweb commented 2 years ago

Alternatively, users can UNION ALL both desktop/mobile views together if needed.

Don't think that would work as need _TABLE_SUFFIX to get the client. Could add client column to latest views, but seems a little redundant and still would require the changes (change to UNION ALL and use client column). Not the worst, but a change from what's there currently...

Try something like

SELECT max(date) from `httparchive.all.pages` WHERE date > '2000-01-01'

Same error:

Query error: Cannot query over table 'httparchive.all.pages' without a filter over column(s) 'date' that can be used for partition elimination at [16:1]

But I think there's a better way using INFORMATION_SCHEMA.PARTITIONS to query the metadata directly:

I'm not sure how to use this info to create the view though?

rviscomi commented 2 years ago

This validates:

  SELECT
    date,
    client,
    page,
    rank,
    payload
  FROM
    `httparchive.all.pages`
  WHERE
    date IS NOT NULL AND
    date = (
      SELECT
        CAST(MAX(partition_id) AS DATE) AS date
      FROM
        `httparchive.all.INFORMATION_SCHEMA.PARTITIONS`
      WHERE
        table_name = 'pages' AND
        partition_id != '__NULL__')
  AND client = 'desktop'

date IS NOT NULL seems to satisfy the validator

Not tested

tunetheweb commented 2 years ago

Close! This works:


CREATE OR REPLACE VIEW `httparchive.scratchspace.test_latest_desktop2` AS (
  SELECT
    date,
    client,
    page,
    rank,
    payload
  FROM
    `httparchive.all.pages`
  WHERE
    date IS NOT NULL AND
    date = (
      SELECT
        CAST(REGEXP_REPLACE(MAX(partition_id), r'(\d{4})(\d{2})(\d{2})', '\\1-\\2-\\3') AS DATE) AS date
      FROM
        `httparchive.all.INFORMATION_SCHEMA.PARTITIONS`
      WHERE
        table_name = 'pages' AND
        partition_id != '__NULL__')
  AND client = 'desktop'
);

select client from `httparchive.scratchspace.test_latest_desktop2`;
image

Only 322MB processed!

tunetheweb commented 2 years ago

And adding rank is even quicker:

select date from `httparchive.scratchspace.test_latest_desktop2` where rank = 1000;
image
rviscomi commented 2 years ago

REGEXP_REPLACE(MAX(partition_id), r'(\d{4})(\d{2})(\d{2})', '\\1-\\2-\\3')

Nit/tip: use r'' to avoid escaping the capture groups r'\1-\2-\3'

And adding rank is even quicker:

So it looks like clustering does affect performance of the views, which is great. In that case I don't see a reason to continue distinguishing between desktop/mobile in the latest tables/views.

tunetheweb commented 2 years ago

In that case I don't see a reason to continue distinguishing between desktop/mobile in the latest tables/views.

Well that removes the wildcard issue!

Though it is a breaking change. But does anyone even use the latest tables? I never do as prefer to be explicit. Guess we’ll find out when we make this change…

rviscomi commented 2 years ago

Yeah I filed this issue in response to a DM from a Googler trying to use one of the latest tables. We could have some sort of deprecation period when we support old and new versions of the views, and announce the timeline on the forum/social.

romaincurutchet commented 2 years ago

Hi Rick and Barry, is there an update on enabling the .latest table view? Thank you.

rviscomi commented 2 years ago

@romaincurutchet not yet sorry. Which views are you querying? We can create a new experimental view to unblock you and verify that the proposed approach works.

tunetheweb commented 2 years ago

I have created three new views in the meantime:

These will automatically point to the latest month of data.

These tables are slightly different to the current tables (as well as not being split by desktop and mobile - use the client column to restrict those) but hopefully should be easy to convert your queries to these new ones. Note there are some extra columns in these, which might make them easier (and cheaper and quicker!) to query if you use them.

We're still figuring our the final schema, so this is subject to change, but hopefully that unblocks you for now @romaincurutchet

romaincurutchet commented 2 years ago

Thank you both!

max-ostapenko commented 2 weeks ago

I just checked that the views give pretty good estimation for queries with cluster filters. So the current solution is the best. Closing.