HTTPArchive / data-pipeline

The new HTTP Archive data pipeline built entirely on GCP
Apache License 2.0
5 stars 0 forks source link

Investigate missing Top 1k home pages #222

Open rviscomi opened 10 months ago

rviscomi commented 10 months ago

For some reason HA has no data for ~90 of the top 1k sites in CrUX:

https://allegro.pl/
https://aquamanga.com/
https://auctions.yahoo.co.jp/
https://auth.uber.com/
https://betproexch.com/
https://blaze-1.com/
https://bollyflix.tax/
https://brainly.com.br/
https://brainly.in/
https://brainly.lat/
https://chance.enjoy.point.auone.jp/
https://cookpad.com/
https://detail.chiebukuro.yahoo.co.jp/
https://e-okul.meb.gov.tr/
https://filmyfly.club/
https://game.hiroba.dpoint.docomo.ne.jp/
https://gamewith.jp/
https://gdz.ru/
https://hdhub4u.markets/
https://hentailib.me/
https://holoo.fun/
https://ifilo.net/
https://indianhardtube.com/
https://login.caixa.gov.br/
https://m.autoplius.lt/
https://m.fmkorea.com/
https://m.happymh.com/
https://m.pgf-asw0zz.com/
https://m.porno365.pics/
https://m.skelbiu.lt/
https://mangalib.me/
https://mangalivre.net/
https://mnregaweb4.nic.in/
https://myaadhaar.uidai.gov.in/
https://myreadingmanga.info/
https://namu.wiki/
https://nhattruyenplus.com/
https://nhentai.net/
https://onlar.az/
https://page.auctions.yahoo.co.jp/
https://passbook.epfindia.gov.in/
https://pixbet.com/
https://pmkisan.gov.in/
https://quizlet.com/
https://schools.emaktab.uz/
https://schools.madrasati.sa/
https://scratch.mit.edu/
https://supjav.com/
https://tathya.uidai.gov.in/
https://uchi.ru/
https://v.daum.net/
https://vl2.xvideos98.pro/
https://vlxx.moe/
https://www.avto.net/
https://www.bartarinha.ir/
https://www.bestbuy.com/
https://www.betproexch.com/
https://www.cardmarket.com/
https://www.chegg.com/
https://www.cityheaven.net/
https://www.deviantart.com/
https://www.dns-shop.ru/
https://www.fiverr.com/
https://www.fmkorea.com/
https://www.hotstar.com/
https://www.idealista.com/
https://www.idealista.it/
https://www.justdial.com/
https://www.khabaronline.ir/
https://www.leboncoin.fr/
https://www.leroymerlin.fr/
https://www.makemytrip.com/
https://www.mediaexpert.pl/
https://www.milanuncios.com/
https://www.namasha.com/
https://www.nettruyenus.com/
https://www.ninisite.com/
https://www.nitrotype.com/
https://www.otvfoco.com.br/
https://www.ozon.ru/
https://www.realtor.com/
https://www.sahibinden.com/
https://www.shahrekhabar.com/
https://www.si.com/
https://www.studocu.com/
https://www.thenetnaija.net/
https://www.varzesh3.com/
https://www.wannonce.com/
https://www.wayfair.com/
https://www.winzogames.com/
https://www.zillow.com/
https://znanija.com/
WITH ha AS (
  SELECT
    page
  FROM
    `httparchive.all.pages`
  WHERE
    date = '2023-10-01' AND
    rank = 1000 AND
    is_root_page
),

crux AS (
  SELECT
    DISTINCT CONCAT(origin, '/') AS page
  FROM
    `chrome-ux-report.materialized.metrics_summary`
  WHERE
    date = '2023-09-01' AND
    rank = 1000
)

SELECT
  page
FROM
  crux
LEFT OUTER JOIN
  ha
USING
  (page)
WHERE
  ha.page IS NULL
ORDER BY
  page

This has been pretty consistent:

Row date top_1k
1 2023-01-01 918  
2 2023-02-01 922  
3 2023-03-01 910  
4 2023-04-01 924  
5 2023-05-01 916  
6 2023-06-01 913  
7 2023-07-01 908  
8 2023-08-01 917  
9 2023-09-01 910  
10 2023-10-01 908

And here are the top 1k home pages that have consistently been missing all year (202301–202309):

https://aquamanga.com/
https://auctions.yahoo.co.jp/
https://betproexch.com/
https://brainly.in/
https://chance.enjoy.point.auone.jp/
https://detail.chiebukuro.yahoo.co.jp/
https://game.hiroba.dpoint.docomo.ne.jp/
https://login.caixa.gov.br/
https://m.fmkorea.com/
https://m.happymh.com/
https://mangalib.me/
https://mangalivre.net/
https://myreadingmanga.info/
https://namu.wiki/
https://page.auctions.yahoo.co.jp/
https://pmkisan.gov.in/
https://quizlet.com/
https://scratch.mit.edu/
https://v.daum.net/
https://www.bartarinha.ir/
https://www.bestbuy.com/
https://www.betproexch.com/
https://www.deviantart.com/
https://www.fiverr.com/
https://www.fmkorea.com/
https://www.idealista.com/
https://www.justdial.com/
https://www.khabaronline.ir/
https://www.leboncoin.fr/
https://www.leroymerlin.fr/
https://www.milanuncios.com/
https://www.namasha.com/
https://www.ninisite.com/
https://www.ozon.ru/
https://www.realtor.com/
https://www.sahibinden.com/
https://www.wannonce.com/

Are the tests erroring out? Are they blocking us?

tunetheweb commented 10 months ago

Just trying the first one (https://allegro.pl/) it also fails in the public WebPageTest: https://www.webpagetest.org/result/231113_AiDcFK_98G/1/details/#waterfall_view_step1 With a 403.

When I try with curl it asks for JS to be enabled and depends on something using https://ct.captcha-delivery.com/c.js

So would guess it's just blocked.

max-ostapenko commented 6 hours ago

I looked into this for September crawl, and the number of missing pages increased to 20%.

There are other reasons besides 403 response, like redirects:

The debug information in the staging dataset would help us see expected VS unexpected cases.

@pmeenan do we log reasons for not collecting crawl data that we could JOIN here?

tunetheweb commented 6 hours ago

Are those sites also available as their own pages?

max-ostapenko commented 6 hours ago

I've found https://www.clever.com/ in CrUX, but not the other one. So yeah, we're loosing a bit of pages here (maybe deduplicating in BQ post crawl could be an alternative).

And could we also run crawl in a headful browser? I believe it will fix big part of blocked pages.

tunetheweb commented 6 hours ago

Well if popular enough page then I would expect it to be in CrUX. Weird that the pre-redirect one is in CrUX at all but maybe they just moved to www this month? Or it’s used for some other non-public reason (e.g. clever.com/intranet).

We do have WPTS in our user agent header so we’re easy to block for people that don’t want crawlers/bots. We could remove that but would rather be a good net citizen and be honest about this.

Another issue is that we only crawl from US data centres which can affect things. For example www.bbc.co.uk redirects to www.bbc.com for US visitors (which is in CrUX separately anyway).

So not sure moving to a headed browser would fix most things that are blocking us.

max-ostapenko commented 5 hours ago

You're right, user agent is more obvious than headless signals.

I'd still like to get a report for crawling 'failures' on a page level, so that we can have an overview of the discrepancies reasons instead of checking them one by one manually.