GoogleChromeLabs / wpp-research

Apache License 2.0
39 stars 5 forks source link

Query to access third-party render-blocking script #30

Open mukeshpanchal27 opened 1 year ago

mukeshpanchal27 commented 1 year ago

For the query use below references:

mukeshpanchal27 commented 1 year ago

Initial draft for the query.

SELECT
  percentile,
  client,
  host,
  APPROX_QUANTILES(requests, 1000)[
OFFSET
  (percentile * 10)] AS num_requests
FROM (
  SELECT
    client,
    page,
  IF
    (NET.HOST(requests.url) IN (
      SELECT
        domain
      FROM
        `httparchive.almanac.third_parties`
      WHERE
        date = '2022-06-01'
        AND category != 'hosting' ), 'third party', 'first party') AS host,
    COUNT(0) AS requests
  FROM
    `httparchive.almanac.requests` AS requests
  JOIN
    `httparchive.pages.2022_06_01_*` AS pages
  ON
    pages.url = requests.page
  WHERE
    date = '2022-06-01'
    AND type = 'script'
    AND JSON_EXTRACT(pages.payload, '$._detected_apps.WordPress') IS NOT NULL
    AND CAST(JSON_EXTRACT( pages.payload, '$._renderBlockingJS') AS INT64) > 0
  GROUP BY
    client,
    page,
    host),
  UNNEST([10, 25, 50, 75, 90, 100]) AS percentile
GROUP BY
  percentile,
  client,
  host
ORDER BY
  client,
  percentile,
  host

cc. @felixarntz

felixarntz commented 1 year ago

@mukeshpanchal27 It looks like this query is based on https://almanac.httparchive.org/en/2022/javascript#requests? Please provide such sources for reference :)

I'm not familiar with the nature of the CMS Almanac-specific tables httparchive.almanac.*, so I'm not sure we can rely on those here. Also, please review the dates, as parts of your query are for October 2022, while others are for June 2022, so that would definitely give us incorrect data.

Last but not least, I see you formatted the query via the BigQuery formatter, which generally is good, however it sometimes formats some parts in a weird way, so in such cases we should manually fix this. For example, this is unnecessarily hard to read:

  APPROX_QUANTILES(requests, 1000)[
OFFSET
  (percentile * 10)] AS num_requests

Would be simpler as:

  APPROX_QUANTILES(requests, 1000)[OFFSET(percentile * 10)] AS num_requests

Feel free to make those changes and open a draft pull request. Better to review in there, not in the issue.

mukeshpanchal27 commented 1 year ago

Thanks @felixarntz, for the comment.

The httparchive.almanac.third_parties has data till June 2022, so I have updated my query to use June 2022 data for httparchive.pages.2022_06_01_*

The query format was also corrected.

kt-12 commented 1 year ago

@felixarntz @mukeshpanchal27 there is a slight issue with this approach-

The goal here is to study third-party scripts that are render-blocking. Issue with initial query is:

Eg. Take a site where it has 2 third-party scripts in the footer. The theme's own JS file (call it theme.js) is render-blocking. Then AND CAST(JSON_EXTRACT( pages.payload, '$._renderBlockingJS') AS INT64) > 0 would return true, sub query will return 2 third party, 1 first party. Whereas, the expected output here was 1 first party, 0 third party as none of the 3p were render-blocking.

We can get both the information from the lighthouse report but traversing it is a bit expensive.

I have come up with a draft query below:

CREATE TEMP FUNCTION
  getCountRenderBlockingTPJS(lighthouseItems STRING,
    fcp FLOAT64,
    pageUrl STRING)
  RETURNS INT64
  LANGUAGE js AS '''
try {
const networkItems = JSON.parse(lighthouseItems);
let countRenderBlockingTPJS = 0
for(const item of networkItems ) {
  // network-items are arranges in order of start time.
  // render-blocking resources start before fcp
  // network can be very large, faster we break faster is the query
  if(item.startTime > fcp) break;
  if( item.resourceType == "Script" &&
      !item.url.startsWith(pageUrl) && // to determine third-party
      (item.priority == "VeryHigh" || item.priority == "High") //render-blocking
    )
  {
    countRenderBlockingTPJS++;
  }
}
return countRenderBlockingTPJS
} catch (e) {
  return -1;
}
''';
SELECT
  client,
  total_wp_sites,
  with_3pJS_render_blocking,
  (total_wp_sites-with_3pJS_render_blocking) AS without_3pJS_render_blocking,
  CONCAT(ROUND(with_3pJS_render_blocking*100/total_wp_sites, 3),' %') AS opportunity
FROM (
  SELECT
    lh._TABLE_SUFFIX AS client,
    COUNT(lh.url) AS total_wp_sites,
    COUNTIF( getCountRenderBlockingTPJS( JSON_EXTRACT(lh.report, '$.audits.network-requests.details.items'),
        CAST( JSON_EXTRACT_SCALAR(lh.report, '$.audits.first-contentful-paint.numericValue') AS FLOAT64 ),
        lh.url) > 0 ) AS with_3pJS_render_blocking
  FROM
    `httparchive.technologies.2022_12_01_*` AS tech
  JOIN
    `httparchive.lighthouse.2022_12_01_*` AS lh
  ON
    tech.url = lh.url
  WHERE
    app = 'WordPress'
    AND category = 'CMS'
    AND lh._TABLE_SUFFIX = tech._TABLE_SUFFIX
  GROUP BY
    lh._TABLE_SUFFIX )

Result obtained (it took ~20 mins for the result):

Row client total_wp_sites with_3pJS_render_blocking without_3pJS_render_blocking opportunity
1 desktop 4340020 1188953 3151067 27.395 %  
2 mobile 5662739 1563492 4099247 27.61 %