HTTPArchive / wptagent

Cross-platform WebPageTest agent
Other
1 stars 0 forks source link

npm is not defined in wappalyzer but is present in technologies column under fingerprinting category #25

Closed max-ostapenko closed 1 week ago

max-ostapenko commented 1 week ago

Reproduce

  1. Find pages with a query:
SELECT
  page,
  technologies
FROM httparchive.crawl_staging.pages,
UNNEST (technologies) AS technology
WHERE
  date = '2024-10-01' AND
  technology.technology = 'npm'
  1. Run WPT Example: https://webpagetest.httparchive.org/result/241111_4M_1/1/technologies/

Technology is missing in WPT results, but present in BQ.

@pmeenan do you have an idea about the cause?

Expected

Those technologies that are described in wappalyzer are present in BQ, but nothing else.

pmeenan commented 1 week ago

Here is one of the tests from the crawl: https://webpagetest.httparchive.org/result/241008_Mx1WF_EO3TC/1/technologies/

Looking at the raw test result I don't see npm anywhere in the data:

detected: {
  Page builders: "Webflow",
  CMS: "Webflow",
  Maps: "Leaflet 0.7.3",
  CDN: "Cloudflare,Netlify,jsDelivr,Google Hosted Libraries,cdnjs",
  Font scripts: "Typekit 1.21.0,Google Font API",
  Comment systems: "Livefyre 0.7.3",
  JavaScript libraries: "LazySizes,core-js 3.19.0,libphonenumber,jQuery 3.5.1,ClientJS 0.1.11,npm",
  Performance: "LazySizes",
  PaaS: "Netlify",
  Security: "HSTS,Cloudflare Bot Management",
  Analytics: "Google Analytics",
  Browser fingerprinting: "ClientJS 0.1.11,npm",
  Miscellaneous: "Open Graph"
},
detected_apps: {
  Webflow: "",
  Leaflet: "0.7.3",
  Cloudflare: "",
  Typekit: "1.21.0",
  Livefyre: "0.7.3",
  LazySizes: "",
  core-js: "3.19.0",
  Netlify: "",
  libphonenumber: "",
  jsDelivr: "",
  jQuery: "3.5.1",
  HSTS: "",
  Google Hosted Libraries: "",
  Google Font API: "",
  Google Analytics: "",
  cdnjs: "",
  Cloudflare Bot Management: "",
  ClientJS: "0.1.11,npm",
  Open Graph: ""
},

It looks like whatever is extracting the version number for ClientJS is pulling a ,npm with the version number which is probably confusing something that does the parsing.

From bigquery:

SELECT * FROM `httparchive.crawl_staging.pages` WHERE date = "2024-10-01" AND wptid = "241008_Mx1WF_EO3TC"

The technologies records are extracted here which is directly from the HAR.

I'll take a quick look at the agent code to see if I can just remove any commas from the resulting wappalyzer detections to make sure it doesn't throw anything off.

pmeenan commented 1 week ago

ok, just pushed a "fix" to strip commas out of all of the application names and version strings as they come out of wappalyzer so there won't be any more parsing errors. Doesn't fix whatever is going wrong with the ClientJS detection but at least it prevents unintended apps.

                "_detected": {
                    "Page builders": "Webflow",
                    "CMS": "Webflow",
                    "Maps": "Leaflet 0.7.3",
                    "CDN": "Cloudflare,Netlify,jsDelivr,Google Hosted Libraries,cdnjs",
                    "Font scripts": "Typekit 1.21.0,Google Font API",
                    "Comment systems": "Livefyre 0.7.3",
                    "JavaScript libraries": "LazySizes,core-js 3.19.0,libphonenumber,jQuery 3.5.1,ClientJS 0.1.11npm",
                    "Performance": "LazySizes",
                    "PaaS": "Netlify",
                    "Security": "HSTS,Cloudflare Bot Management",
                    "Analytics": "Google Analytics",
                    "Browser fingerprinting": "ClientJS 0.1.11npm",
                    "Miscellaneous": "Open Graph"
                },
                "_detected_apps": {
                    "Webflow": "",
                    "Leaflet": "0.7.3",
                    "Cloudflare": "",
                    "Typekit": "1.21.0",
                    "Livefyre": "0.7.3",
                    "LazySizes": "",
                    "core-js": "3.19.0",
                    "Netlify": "",
                    "libphonenumber": "",
                    "jsDelivr": "",
                    "jQuery": "3.5.1",
                    "HSTS": "",
                    "Google Hosted Libraries": "",
                    "Google Font API": "",
                    "Google Analytics": "",
                    "cdnjs": "",
                    "Cloudflare Bot Management": "",
                    "ClientJS": "0.1.11npm",
                    "Open Graph": ""
                },

Fix will be in the November crawl which should kick off tomorrow.

max-ostapenko commented 1 week ago

Thanks. There were 3 more instances related to this issue, will fix the rules.