Open mukeshpanchal27 opened 1 year ago
Initial draft for the query.
SELECT
percentile,
client,
host,
APPROX_QUANTILES(requests, 1000)[
OFFSET
(percentile * 10)] AS num_requests
FROM (
SELECT
client,
page,
IF
(NET.HOST(requests.url) IN (
SELECT
domain
FROM
`httparchive.almanac.third_parties`
WHERE
date = '2022-06-01'
AND category != 'hosting' ), 'third party', 'first party') AS host,
COUNT(0) AS requests
FROM
`httparchive.almanac.requests` AS requests
JOIN
`httparchive.pages.2022_06_01_*` AS pages
ON
pages.url = requests.page
WHERE
date = '2022-06-01'
AND type = 'script'
AND JSON_EXTRACT(pages.payload, '$._detected_apps.WordPress') IS NOT NULL
AND CAST(JSON_EXTRACT( pages.payload, '$._renderBlockingJS') AS INT64) > 0
GROUP BY
client,
page,
host),
UNNEST([10, 25, 50, 75, 90, 100]) AS percentile
GROUP BY
percentile,
client,
host
ORDER BY
client,
percentile,
host
cc. @felixarntz
@mukeshpanchal27 It looks like this query is based on https://almanac.httparchive.org/en/2022/javascript#requests? Please provide such sources for reference :)
I'm not familiar with the nature of the CMS Almanac-specific tables httparchive.almanac.*
, so I'm not sure we can rely on those here. Also, please review the dates, as parts of your query are for October 2022, while others are for June 2022, so that would definitely give us incorrect data.
Last but not least, I see you formatted the query via the BigQuery formatter, which generally is good, however it sometimes formats some parts in a weird way, so in such cases we should manually fix this. For example, this is unnecessarily hard to read:
APPROX_QUANTILES(requests, 1000)[
OFFSET
(percentile * 10)] AS num_requests
Would be simpler as:
APPROX_QUANTILES(requests, 1000)[OFFSET(percentile * 10)] AS num_requests
Feel free to make those changes and open a draft pull request. Better to review in there, not in the issue.
Thanks @felixarntz, for the comment.
The httparchive.almanac.third_parties
has data till June 2022, so I have updated my query to use June 2022 data for httparchive.pages.2022_06_01_*
The query format was also corrected.
@felixarntz @mukeshpanchal27 there is a slight issue with this approach-
The goal here is to study third-party scripts that are render-blocking. Issue with initial query is:
httparchive.almanac.third_parties
in combination with httparchive.almanac.requests
do provide us with information on third-party scripts. However it doesn't record the render-blocking nature of the scripthttparchive.pages.2022_06_01_*
gives us the count of render-blocking scripts, but we can't tell if the third-party script is one among them. Eg.
Take a site where it has 2 third-party scripts in the footer. The theme's own JS file (call it theme.js) is render-blocking.
Then AND CAST(JSON_EXTRACT( pages.payload, '$._renderBlockingJS') AS INT64) > 0 would return true, sub query will return 2 third party, 1 first party
. Whereas, the expected output here was 1 first party, 0 third party
as none of the 3p were render-blocking.
We can get both the information from the lighthouse report but traversing it is a bit expensive.
I have come up with a draft query below:
CREATE TEMP FUNCTION
getCountRenderBlockingTPJS(lighthouseItems STRING,
fcp FLOAT64,
pageUrl STRING)
RETURNS INT64
LANGUAGE js AS '''
try {
const networkItems = JSON.parse(lighthouseItems);
let countRenderBlockingTPJS = 0
for(const item of networkItems ) {
// network-items are arranges in order of start time.
// render-blocking resources start before fcp
// network can be very large, faster we break faster is the query
if(item.startTime > fcp) break;
if( item.resourceType == "Script" &&
!item.url.startsWith(pageUrl) && // to determine third-party
(item.priority == "VeryHigh" || item.priority == "High") //render-blocking
)
{
countRenderBlockingTPJS++;
}
}
return countRenderBlockingTPJS
} catch (e) {
return -1;
}
''';
SELECT
client,
total_wp_sites,
with_3pJS_render_blocking,
(total_wp_sites-with_3pJS_render_blocking) AS without_3pJS_render_blocking,
CONCAT(ROUND(with_3pJS_render_blocking*100/total_wp_sites, 3),' %') AS opportunity
FROM (
SELECT
lh._TABLE_SUFFIX AS client,
COUNT(lh.url) AS total_wp_sites,
COUNTIF( getCountRenderBlockingTPJS( JSON_EXTRACT(lh.report, '$.audits.network-requests.details.items'),
CAST( JSON_EXTRACT_SCALAR(lh.report, '$.audits.first-contentful-paint.numericValue') AS FLOAT64 ),
lh.url) > 0 ) AS with_3pJS_render_blocking
FROM
`httparchive.technologies.2022_12_01_*` AS tech
JOIN
`httparchive.lighthouse.2022_12_01_*` AS lh
ON
tech.url = lh.url
WHERE
app = 'WordPress'
AND category = 'CMS'
AND lh._TABLE_SUFFIX = tech._TABLE_SUFFIX
GROUP BY
lh._TABLE_SUFFIX )
Result obtained (it took ~20 mins for the result):
Row | client | total_wp_sites | with_3pJS_render_blocking | without_3pJS_render_blocking | opportunity | |
---|---|---|---|---|---|---|
1 | desktop | 4340020 | 1188953 | 3151067 | 27.395 % | |
2 | mobile | 5662739 | 1563492 | 4099247 | 27.61 % |
For the query use below references: