HTTPArchive / cwv-tech-report

Core Web Vitals Technology Report
https://cwvtech.report
32 stars 2 forks source link

Investigate missing adoption of Hostinger #24

Closed rviscomi closed 1 year ago

rviscomi commented 2 years ago

Hostinger exists as a supported technology in Wappalyzer, but we're not detecting any pages that use them.

Looking at the Wappalyzer source code, this technology seems to be using an unusual detection method (DNS):

https://github.com/wappalyzer/wappalyzer/blob/c4704aa76175a98d4212bcaa126ceb1473e51e8e/src/technologies/h.json#L888-L902

@pmeenan is that something that is supported by the WPT agent's driver code?

cc @ThierryA

pmeenan commented 2 years ago

No, the DNS code path isn't wired up in the agent (and doing it would probably be a fairly big change to the order the agent does things).

WPT does do a DNS pass for the base page though and logs the authoritative DNS for the origin:

base_page_dns_server: "any1.hostinger.com",

As well as a reverse-IP lookup on the origin IP (and any CNAME that the origin uses):

base_page_ip_ptr: "", base_page_cname: "",

If you're looking for hosting information, that's probably the more reliable way to do it.

Can open a WPT agent issue to do the DNS work before running wappalyzer but I'm not sure it will fly because they are currently done in parallel and it will slow down tests.

rviscomi commented 2 years ago

Thanks for looking into it. I suppose it can't hurt to open the issue on the WPT side to at least track the limitation and explore alternatives.

It's gross but I suppose it's possible that we can backstop some of these missing detections on the HA side using that host metadata. For example, in the Dataflow pipeline and test each page against Wappalyzer's DNS rules and emulate the Wappalyzer detections in the HAR.

pmeenan commented 2 years ago

I have most of the DNS logic implemented and hooked up but still not getting detections from Wappalyzer. I filed an issue to hopefully figure out if I'm holding it wrong

pmeenan commented 2 years ago

Should be fixed with the next crawl. Just merged. Here is a sample test. It uses Hostinger, Google mail and Amazon SES which are detected through DNS SOA, MX and TXT records.

"_detected": {
  "Ecommerce": "Cart Functionality",
  "Programming languages": "PHP,Java",
  "UI frameworks": "Bootstrap 5",
  "PaaS": "Amazon Web Services",
  "JavaScript frameworks": "Vue.js 6995",
  "Analytics": "Pinterest Conversion Tag,Microsoft Clarity 0.6.36,Google Analytics,Google Ads Conversion Tracking,Facebook Pixel 2.9.66,Cloudflare Browser Insights",
  "RUM": "New Relic,Cloudflare Browser Insights",
  "JavaScript libraries": "core-js 3.6.5",
  "Reviews": "Trustpilot",
  "Advertising": "Microsoft Advertising",
  "Hosting": "Hostinger",
  "Webmail": "Google Workspace",
  "Email": "Google Workspace,Amazon SES",
  "Tag managers": "Google Tag Manager",
  "A\/B Testing": "Google Optimize",
  "CDN": "Google Hosted Libraries,Cloudflare",
  "Font scripts": "Google Font API"
},
"_detected_apps": {
  "Cart Functionality": "",
  "PHP": "",
  "Java": "",
  "Bootstrap": "5",
  "Amazon Web Services": "",
  "Vue.js": "6995",
  "Pinterest Conversion Tag": "",
  "New Relic": "",
  "core-js": "3.6.5",
  "Trustpilot": "",
  "Microsoft Clarity": "0.6.36",
  "Microsoft Advertising": "",
  "Hostinger": "",
  "Google Workspace": "",
  "Google Tag Manager": "",
  "Google Optimize": "",
  "Google Hosted Libraries": "",
  "Google Font API": "",
  "Google Analytics": "",
  "Google Ads Conversion Tracking": "",
  "Facebook Pixel": "2.9.66",
  "Cloudflare Browser Insights": "",
  "Cloudflare": "",
  "Amazon SES": ""
},

I also added the raw DNS for the origin to the har:

"_origin_dns": {
    "cname": [
        "www.hostinger.com.cdn.cloudflare.net."
    ],
    "ns": [
        "any2.hostinger.com.",
        "any1.hostinger.com."
    ],
    "mx": [
        "1 aspmx.l.google.com.",
        "10 aspmx3.googlemail.com.",
        "10 aspmx2.googlemail.com.",
        "5 alt2.aspmx.l.google.com.",
        "5 alt1.aspmx.l.google.com."
    ],
    "txt": [
        "\"v=spf1 ip4:31.220.23.4 include:_spf.google.com include:amazonses.com include:_spf.hostedemail.com include:_spf.psm.knowbe4.com -all\"",
        "\"apple-domain-verification=IyFbOUpTx9DUOFwL\"",
        "\"mailru-verification: a8a9886e0072b036\"",
        "\"google-site-verification=4EfGmYRIEIPWA_ACJsA5zFGUzzY1pa8Du2tiHb8EKuI\"",
        "\"google-site-verification=MOjKs17dYrFXyEPndU4bK505my3D0dyC63-c5mvaNGU\"",
        "\"nordpass-domain-verification=6b627232b00e4e9ea70693c7994f2d50\""
    ],
    "soa": [
        "any1.hostinger.com. dns.hostinger.com. 2021102522 10800 3600 604800 3600"
    ]
},

Will leave this open until we can verify after the crawl but DNS-based detections should be working now.

rviscomi commented 2 years ago

Great, thanks!

ton31337 commented 1 year ago

I'm working at Hostinger, a quick comment: you can distinguish if the site is hosted under Hostinger using HTTP headers:

platform: hostinger or server: hcdn.

Related: https://github.com/wappalyzer/wappalyzer/pull/7186

rviscomi commented 1 year ago

Thanks @ton31337, once those changes land in Wappalyzer we should be able to automatically pick them up in our reporting.

To close out this thread, it looks like @pmeenan's change worked and we started seeing Hostinger data in the CWV Tech Report in August

image
ton31337 commented 1 year ago

@rviscomi is it possible to somehow extract the real website addresses (URLs) for a specific technology? We would like to identify the top slowest websites and do some performance analysis/improvements.

rviscomi commented 1 year ago

Yeah it's possible using BigQuery, for example the top 10 Hostinger sites with the slowest p75 TTFB:

DECLARE _YYYYMMDD DATE DEFAULT '2023-02-01';

WITH pages AS (
  SELECT DISTINCT
    root_page
  FROM
    `httparchive.all.pages`,
    UNNEST(technologies) AS t
  WHERE
    date = _YYYYMMDD AND
    t.technology = 'Hostinger'
),

crux AS (
  SELECT
    CONCAT(origin, '/') AS root_page,
    p75_ttfb
  FROM
    `chrome-ux-report.materialized.metrics_summary`
  WHERE
    date = _YYYYMMDD
)

SELECT
  root_page,
  p75_ttfb
FROM
  pages
JOIN
  crux
USING
  (root_page)
ORDER BY
  p75_ttfb DESC
LIMIT
  10

(12 GB processed)

Results:

root_page p75_ttfb
https://medpress.az/ 42300
https://aljens.info/ 29100
https://boletimdopaddock.com.br/ 22400
https://pvst.com.br/ 20900
https://www.travelnthrill.com/ 20300
https://mejora2.online/ 19200
https://www.delsolinmobiliaria.com.ar/ 18600
https://nimt.in/ 17900
https://duplaimagemgastro.com.br/ 17600
https://surveyandoffers.com/ 17200

I'd be really curious to hear if this leads you to identifying any actionable issues. Keep me posted!

ton31337 commented 1 year ago

Thanks @rviscomi!