Closed max-ostapenko closed 1 week ago
Here is one of the tests from the crawl: https://webpagetest.httparchive.org/result/241008_Mx1WF_EO3TC/1/technologies/
Looking at the raw test result I don't see npm anywhere in the data:
detected: {
Page builders: "Webflow",
CMS: "Webflow",
Maps: "Leaflet 0.7.3",
CDN: "Cloudflare,Netlify,jsDelivr,Google Hosted Libraries,cdnjs",
Font scripts: "Typekit 1.21.0,Google Font API",
Comment systems: "Livefyre 0.7.3",
JavaScript libraries: "LazySizes,core-js 3.19.0,libphonenumber,jQuery 3.5.1,ClientJS 0.1.11,npm",
Performance: "LazySizes",
PaaS: "Netlify",
Security: "HSTS,Cloudflare Bot Management",
Analytics: "Google Analytics",
Browser fingerprinting: "ClientJS 0.1.11,npm",
Miscellaneous: "Open Graph"
},
detected_apps: {
Webflow: "",
Leaflet: "0.7.3",
Cloudflare: "",
Typekit: "1.21.0",
Livefyre: "0.7.3",
LazySizes: "",
core-js: "3.19.0",
Netlify: "",
libphonenumber: "",
jsDelivr: "",
jQuery: "3.5.1",
HSTS: "",
Google Hosted Libraries: "",
Google Font API: "",
Google Analytics: "",
cdnjs: "",
Cloudflare Bot Management: "",
ClientJS: "0.1.11,npm",
Open Graph: ""
},
It looks like whatever is extracting the version number for ClientJS is pulling a ,npm
with the version number which is probably confusing something that does the parsing.
From bigquery:
SELECT * FROM `httparchive.crawl_staging.pages` WHERE date = "2024-10-01" AND wptid = "241008_Mx1WF_EO3TC"
The technologies records are extracted here which is directly from the HAR.
I'll take a quick look at the agent code to see if I can just remove any commas from the resulting wappalyzer detections to make sure it doesn't throw anything off.
ok, just pushed a "fix" to strip commas out of all of the application names and version strings as they come out of wappalyzer so there won't be any more parsing errors. Doesn't fix whatever is going wrong with the ClientJS detection but at least it prevents unintended apps.
"_detected": {
"Page builders": "Webflow",
"CMS": "Webflow",
"Maps": "Leaflet 0.7.3",
"CDN": "Cloudflare,Netlify,jsDelivr,Google Hosted Libraries,cdnjs",
"Font scripts": "Typekit 1.21.0,Google Font API",
"Comment systems": "Livefyre 0.7.3",
"JavaScript libraries": "LazySizes,core-js 3.19.0,libphonenumber,jQuery 3.5.1,ClientJS 0.1.11npm",
"Performance": "LazySizes",
"PaaS": "Netlify",
"Security": "HSTS,Cloudflare Bot Management",
"Analytics": "Google Analytics",
"Browser fingerprinting": "ClientJS 0.1.11npm",
"Miscellaneous": "Open Graph"
},
"_detected_apps": {
"Webflow": "",
"Leaflet": "0.7.3",
"Cloudflare": "",
"Typekit": "1.21.0",
"Livefyre": "0.7.3",
"LazySizes": "",
"core-js": "3.19.0",
"Netlify": "",
"libphonenumber": "",
"jsDelivr": "",
"jQuery": "3.5.1",
"HSTS": "",
"Google Hosted Libraries": "",
"Google Font API": "",
"Google Analytics": "",
"cdnjs": "",
"Cloudflare Bot Management": "",
"ClientJS": "0.1.11npm",
"Open Graph": ""
},
Fix will be in the November crawl which should kick off tomorrow.
Thanks. There were 3 more instances related to this issue, will fix the rules.
Reproduce
Technology is missing in WPT results, but present in BQ.
@pmeenan do you have an idea about the cause?
Expected
Those technologies that are described in wappalyzer are present in BQ, but nothing else.