Closed rviscomi closed 2 weeks ago
As investigated in #142 this should be possible, with the exception that they cannot be used in wildcard queries.
My main concerns would be:
all
datastream is streaming, so would need to update the views after that's finished to avoid people querying half datasets.One possibility is to create a httparchive.all.latest_date
table with one date (e.g. 2022-09-01
) and then create the view using something like:
CREATE OR REPLACE VIEW `httparchive.scratchspace.test_latest_desktop` AS (
SELECT
date,
client
....
FROM
`httparchive.all.pages`
JOIN
`httparchive.all.latest_date`
USING (date)
WHERE
client = 'desktop'
);
And then could just update httparchive.all.latest_date
at the end of each month's run to signal that data is now ready, and all the views referencing that would automatically switch to the new date.
I ran this, and it seemed to work:
CREATE OR REPLACE VIEW `httparchive.scratchspace.test_latest_date` AS (
SELECT CAST('2022-08-01' AS DATE) AS date
)
CREATE OR REPLACE VIEW `httparchive.scratchspace.test_latest_desktop` AS (
SELECT
date,
client,
page,
rank,
payload
FROM
`httparchive.all.pages`
JOIN
`httparchive.scratchspace.test_latest_date`
USING (date)
WHERE
client = 'desktop'
);
and I get these costs:
SELECT page FROM `httparchive.scratchspace.test_latest_desktop`; -- 1.13GB
SELECT page, rank FROM `httparchive.scratchspace.test_latest_desktop`; -- 1.27GB
SELECT page, rank, payload FROM `httparchive.scratchspace.test_latest_desktop`; -- 4.79 TB
SELECT page, rank, payload FROM `httparchive.scratchspace.test_latest_desktop` WHERE rank = 1000; -- 532.58MB
My main concerns would be:
- The wildcard issue
- The
all
datastream is streaming, so would need to update the views after that's finished to avoid people querying half datasets.
Could you clarify if the wildcard issue applies to the all
dataset? Unlike the dated tables like pages.2022_09_01
that need wildcards like pages.*
to process multiple dates, the all.pages
table is partitioned by date.
Also, we've recently switched from streaming to batch inserts due to maintenance and data quality complexities, so partial datasets are no longer a concern.
Could you clarify if the wildcard issue applies to the
all
dataset? Unlike the dated tables likepages.2022_09_01
that need wildcards likepages.*
to process multiple dates, theall.pages
table is partitioned by date.
No, the wildcard issue only applies if And httparchive.latest.summary_pages_desktop
and httparchive.latest.summary_pages_mobile
are views on the all
dataset and you want to run this:
SELECT
_TABLE_SUFFIX AS client,
col1
FROM
`httparchive.latest.summary_pages_*`
That will not work, as it's using a wildcard on two views.
But if you did this to query one table, it would work fine:
SELECT
col1
FROM
``httparchive.latest.summary_pages_desktop
Also, we've recently switched from streaming to batch inserts due to maintenance and data quality complexities, so partial datasets are no longer a concern.
Oh yeah keep forgetting this!
We'd still need to redefine the latest views each month as part of the batch after the data is loaded to look at the latest date.
I tried doing this, but it didn't work:
CREATE OR REPLACE VIEW `httparchive.scratchspace.test_latest_desktop2` AS (
SELECT
date,
client,
page,
rank,
payload
FROM
`httparchive.all.pages`
WHERE
client = 'desktop' AND
date = (SELECT max(date) from `httparchive.all.pages`)
);
SELECT client FROM `httparchive.scratchspace.test_latest_desktop2`;
The last SELECT
complains of a missing required date param.
Where as hard coding the date (AND date = 2022-08-01') in the view definition, or joining to a
latest_date` table, allows the last SELECT to run fine.
The all
tables are clustered by client, so is there any reason not to have a single view for each legacy table type, for example latest.summary_pages
? The user could filter to a client with WHERE client = 'desktop'
if needed. Based on your other experiment, it seems like it should have the same performance gains as applying the filter on the clustered table itself.
Alternatively, users can UNION ALL
both desktop/mobile views together if needed.
The last
SELECT
complains of a missing required date param.
Try something like
SELECT max(date) from `httparchive.all.pages` WHERE date > '2000-01-01'
But I think there's a better way using INFORMATION_SCHEMA.PARTITIONS
to query the metadata directly:
SELECT
MAX(partition_id)
FROM
`httparchive.all.INFORMATION_SCHEMA.PARTITIONS`
WHERE
table_name = 'pages' AND
partition_id != '__NULL__'
Alternatively, users can UNION ALL both desktop/mobile views together if needed.
Don't think that would work as need _TABLE_SUFFIX
to get the client. Could add client
column to latest views, but seems a little redundant and still would require the changes (change to UNION ALL and use client
column). Not the worst, but a change from what's there currently...
Try something like
SELECT max(date) from `httparchive.all.pages` WHERE date > '2000-01-01'
Same error:
Query error: Cannot query over table 'httparchive.all.pages' without a filter over column(s) 'date' that can be used for partition elimination at [16:1]
But I think there's a better way using INFORMATION_SCHEMA.PARTITIONS to query the metadata directly:
I'm not sure how to use this info to create the view though?
This validates:
SELECT
date,
client,
page,
rank,
payload
FROM
`httparchive.all.pages`
WHERE
date IS NOT NULL AND
date = (
SELECT
CAST(MAX(partition_id) AS DATE) AS date
FROM
`httparchive.all.INFORMATION_SCHEMA.PARTITIONS`
WHERE
table_name = 'pages' AND
partition_id != '__NULL__')
AND client = 'desktop'
date IS NOT NULL
seems to satisfy the validator
Not tested
Close! This works:
CREATE OR REPLACE VIEW `httparchive.scratchspace.test_latest_desktop2` AS (
SELECT
date,
client,
page,
rank,
payload
FROM
`httparchive.all.pages`
WHERE
date IS NOT NULL AND
date = (
SELECT
CAST(REGEXP_REPLACE(MAX(partition_id), r'(\d{4})(\d{2})(\d{2})', '\\1-\\2-\\3') AS DATE) AS date
FROM
`httparchive.all.INFORMATION_SCHEMA.PARTITIONS`
WHERE
table_name = 'pages' AND
partition_id != '__NULL__')
AND client = 'desktop'
);
select client from `httparchive.scratchspace.test_latest_desktop2`;
Only 322MB processed!
And adding rank is even quicker:
select date from `httparchive.scratchspace.test_latest_desktop2` where rank = 1000;
REGEXP_REPLACE(MAX(partition_id), r'(\d{4})(\d{2})(\d{2})', '\\1-\\2-\\3')
Nit/tip: use r''
to avoid escaping the capture groups r'\1-\2-\3'
And adding rank is even quicker:
So it looks like clustering does affect performance of the views, which is great. In that case I don't see a reason to continue distinguishing between desktop/mobile in the latest
tables/views.
In that case I don't see a reason to continue distinguishing between desktop/mobile in the
latest
tables/views.
Well that removes the wildcard issue!
Though it is a breaking change. But does anyone even use the latest
tables? I never do as prefer to be explicit. Guess we’ll find out when we make this change…
Yeah I filed this issue in response to a DM from a Googler trying to use one of the latest
tables. We could have some sort of deprecation period when we support old and new versions of the views, and announce the timeline on the forum/social.
Hi Rick and Barry, is there an update on enabling the .latest table view? Thank you.
@romaincurutchet not yet sorry. Which views are you querying? We can create a new experimental view to unblock you and verify that the proposed approach works.
I have created three new views in the meantime:
httparchive.latest.pages
httparchive.latest.requests
httparchive.latest.lighthouse
These will automatically point to the latest month of data.
These tables are slightly different to the current tables (as well as not being split by desktop and mobile - use the client
column to restrict those) but hopefully should be easy to convert your queries to these new ones. Note there are some extra columns in these, which might make them easier (and cheaper and quicker!) to query if you use them.
We're still figuring our the final schema, so this is subject to change, but hopefully that unblocks you for now @romaincurutchet
Thank you both!
I just checked that the views give pretty good estimation for queries with cluster filters. So the current solution is the best. Closing.
We have scheduled queries that generate the tables in the
httparchive.latest
dataset. These tables currently have no content, due to an unknown bug. Can we leverage views and the new partitionedall
dataset to make this process more streamlined and maintenance-free?