Closed rockeynebhwani closed 2 years ago
Dunno why this happens but can tell you the outliers are rare enough they can pretty much be ignored:
SELECT DISTINCT
t1.category,
t1.app,
t1.total,
t2.category,
t2.app,
t2.total
FROM
(SELECT category, app, count(1) AS total FROM `httparchive.technologies.2020_10_01_mobile` GROUP BY category, app) t1,
(SELECT category, app, count(1) AS total FROM `httparchive.technologies.2020_10_01_mobile` GROUP BY category, app) t2
WHERE
REPLACE(t1.category, ' ', '') = REPLACE(t2.category, ' ', '') AND
REPLACE(t1.app, ' ', '') = REPLACE(t2.app, ' ', '') AND
(t1.category != t1.category OR t1.app != t2.app) AND
t1.total >= t2.total
ORDER BY
t1.category,
t1.app
category | app | total | category_1 | app_1 | total_1 |
---|---|---|---|---|---|
Advertising | Google AdSense | 810,985 | Advertising | GoogleAdSense | 1 |
Analytics | Baidu Analytics (百度统计) | 19,220 | Analytics | BaiduAnalytics (百度统计) | 1 |
Analytics | Google Analytics | 4,618,469 | Analytics | Google Analytics | 1 |
Analytics | Google Analytics | 4,618,469 | Analytics | GoogleAnalytics | 30 |
Analytics | GoogleAnalytics | 30 | Analytics | Google Analytics | 1 |
Analytics | Tencent Analytics (腾讯分析) | 236 | Analytics | TencentAnalytics(腾讯分析) | 1 |
CDN | Netlify | 10,958 | CDN | Netlify | 1 |
CMS | Adobe Experience Manager | 16,732 | CMS | AdobeExperience Manager | 1 |
CMS | TYPO3 CMS | 38,789 | CMS | TYPO3CMS | 1 |
Ecommerce | Cart Functionality | 871,039 | Ecommerce | CartFunctionality | 7 |
Ecommerce | Salesforce Commerce Cloud | 3,611 | Ecommerce | SalesforceCommerceCloud | 2 |
Ecommerce | SAP Commerce Cloud | 2,324 | Ecommerce | SAPCommerceCloud | 1 |
Font scripts | Font Awesome | 2,286,759 | Font scripts | FontAwesome | 3 |
Font scripts | FontAwesome 4.7.0 | 2 | Font scripts | Font Awesome 4.7.0 | 1 |
Font scripts | Google Font API | 3,292,664 | Font scripts | GoogleFontAPI | 9 |
JavaScript frameworks | Gatsby | 7,645 | JavaScript frameworks | Gatsby | 1 |
JavaScript frameworks | React | 327,194 | JavaScript frameworks | React | 1 |
JavaScript graphics | Raphael | 19,947 | JavaScript graphics | Raphael | 1 |
JavaScript libraries | jQuery Migrate | 1,610,580 | JavaScript libraries | jQueryMigrate | 2 |
JavaScript libraries | jQuery UI | 1,453,426 | JavaScript libraries | jQuery UI | 2 |
JavaScript libraries | Modernizr | 1,084,419 | JavaScript libraries | Modernizr | 1 |
Maps | Google Maps | 341,206 | Maps | GoogleMaps | 1 |
Miscellaneous | Google Code Prettify | 16,660 | Miscellaneous | GoogleCodePrettify | 2 |
Miscellaneous | Swiper Slider | 464,738 | Miscellaneous | SwiperSlider | 2 |
Miscellaneous | Twitter Emoji (Twemoji) | 1,634,230 | Miscellaneous | TwitterEmoji(Twemoji) | 1 |
Miscellaneous | webpack | 342,236 | Miscellaneous | webpack | 1 |
Operating systems | Windows Server | 524,703 | Operating systems | WindowsServer | 20 |
PaaS | Netlify | 10,958 | PaaS | Netlify | 1 |
PaaS | WP Engine | 75,925 | PaaS | WPEngine | 1 |
Static site generator | Gatsby | 7,645 | Static site generator | Gatsby | 1 |
Tag managers | Google Tag Manager | 2,590,588 | Tag managers | GoogleTagManager | 13 |
UI frameworks | animate.css | 497,980 | UI frameworks | animate.css | 1 |
UI frameworks | Bootstrap | 1,989,296 | UI frameworks | Bootstrap | 4 |
Video players | MediaElement.js | 250,379 | Video players | MediaElement.js | 1 |
Web frameworks | Microsoft ASP.NET | 460,820 | Web frameworks | MicrosoftASP.NET | 11 |
Web servers | Apache Tomcat | 22,123 | Web servers | ApacheTomcat | 2 |
Widgets | 1,894,439 | Widgets | 1 | ||
Widgets | OWL Carousel | 576,306 | Widgets | OWLCarousel | 4 |
Wikis | MediaWiki | 5,801 | Wikis | MediaWiki | 2 |
Thanks @barrypollard . Agree that it's small enough can be ignored. I have ignored for now.
@pmeenan not urgent, but since you were looking at this code there any ideas on this one? Very small numbers but odd that spaces are stripped very rarely. Remember looking at the time and saw same in WPT for those URLs (but not Wappalyzer website) so think a WPT issue was repeatable. Meant to raise and issue but forgot until you jolted my memory!
@bazzadp If you can still repeat it, it would really help to have a repro case. I'm wondering if the Wappalyzer definitions were updated mid-crawl and the spaces were added.
Since then the whole wappalyzer engine was updated and changed so it will be more useful if we can see it in the May 2021 crawl.
https://jelly-pop.com/ still shows it. For example: https://webpagetest.org/jsonResult.php?test=210416_BiDcT0_20469058a349eb72f7a0548367c14bd3&pretty=1 has Google Analytics without a space:
Whereas https://almanac.httparchive.org/en/2020/ has Google Analytics with a space: https://webpagetest.org/jsonResult.php?test=210416_BiDcK5_5cd247abc1ef655159c773f30e457da1&pretty=1
That is bizarre. I wonder if the page itself is overriding some array or other ops because when I take the raw output from wappalyzer and run it through the same code not in the page, it keeps the spaces but if I run it on the console for that page then it strips them out (maybe a code page issue). At least I can reproduce it now though so it should be easier to fix
Ahh, looks like the pages override string.trim() and cause it to remove all of the whitespace. Since the Wappalyzer definitions don't have any trailing whitespace I can just remove the trim operations.
Should be fixed now (well, over the next hour as the agent update rolls out).
Why, why would anyone do this? You get all sorts when you look at 7.5 milllion web pages...
Good work nailing it down.
Thanks for tracking that down and fixing @pmeenan!
Can we close this?
Was going to give it a quick check after May crawl and then close it. There was also another issue where the technologies were all messed up as discussed on the HttpArchive slack.
So I say let's leave this open as a reminder to check the technologies results after May crawl as it's a key data for the Web Almanac so want to make sure it's definitely sorted before our crawl month.
Confirmed as all fixed in May crawl. Same query above gives 0 results.
Reopening this to track a related issue.
I noticed that "GoDaddy Website Builder" no longer has any websites detected since February:
SELECT
_TABLE_SUFFIX AS suffix,
COUNT(DISTINCT url) AS urls
FROM
`httparchive.technologies.2021_*`
WHERE
app = 'GoDaddy Website Builder'
GROUP BY
suffix
ORDER BY
suffix
suffix urls
01_01_desktop 7525
01_01_mobile 11006
02_01_desktop 327
Interestingly, ismyhostfastyet.com is still able to detect GWB sites because it uses a signal from the HTTP headers and it's detecting 7k desktop pages as of May.
So I pulled out a URL from that detection and used the Wappalyzer extension for Chrome to get a true positive detection on https://finsbarandgrill.com/
However, in HA BQ and a plain WPT, we're only detecting Google Analytics:
SELECT
*
FROM
`httparchive.technologies.2021_05_01_desktop`
WHERE
url = 'https://finsbarandgrill.com/'
url category app info
https://finsbarandgrill.com/ Analytics Google Analytics ""
https://webpagetest.org/jsonResult.php?test=210616_AiDcFQ_67699b36864cd25851abcfd97c53f792&pretty=1
"detected": {
"Analytics": "Google Analytics"
},
"detected_apps": {
"Google Analytics": ""
},
Wappalyzer uses a meta[name=generator] signal to detect GWB and the meta tag exists on the page as we'd expect:
"GoDaddy Website Builder": {
"cats": [
1
],
"cookies": {
"dps_site_id": ""
},
"icon": "godaddy.svg",
"meta": {
"generator": "Go Daddy Website Builder (.+)\\;version:\\1"
},
"pricing": [
"mid"
],
"saas": true,
"website": "https://www.godaddy.com/websites/website-builder"
},
// "Starfield Technologies; Go Daddy Website Builder 8.0.0000"
document.querySelector('meta[name=generator]').getAttribute('content')
@pmeenan this leads me to believe that there may be an integration bug with Wappalyzer in WPT. Would you be able to look into this?
The Wappalyzer checks weren't including the meta tags. Agent has been updated and will be rolling out over the next hour.
Tested it in dev here and it correctly caught Go Daddy.
"detected": {
"CMS": "GoDaddy Website Builder 8.0.0000",
"Analytics": "Google Analytics"
},
"detected_apps": {
"GoDaddy Website Builder": "8.0.0000",
"Google Analytics": ""
},
Thanks @pmeenan! I also noticed that the Wappalyzer extension detected React on that page, but it's not included in the new test results. Could something else be missing?
Possibly the serialized DOM. It serializes the HTML but not the DOM. Taking a look now.
Reached out to the Wappalyzer team to see how best to handle DOM-based detections. They are starting to migrate to it but the current engine doesn't directly support it and they have the extension doing the detections manually. Hoping there is a better way but should have something figured out soon.
Whew. That was somewhat more painful than I expected. Had to rewrite the JS variable detection part which changed pretty significantly when the engine changed a few months back (also added the support for the DOM detections).
Here is an updated test. Change is rolling out to prod (and HA) over the next hour.
"_detected": {
"CMS": "GoDaddy Website Builder 8.0.0000",
"JavaScript libraries": "React 16.13.1,Lodash 4.17.5",
"Analytics": "Google Analytics"
},
"_detected_apps": {
"GoDaddy Website Builder": "8.0.0000",
"React": "16.13.1",
"Lodash": "4.17.5",
"Google Analytics": ""
}
That's great, thanks for fixing!
Something to keep an eye on to verify that the fix is working. Here's a query to measure the change in origins from January to May for all technologies, using the table for the CWV Technology Report dashboard:
SELECT
app,
SAFE_DIVIDE(may.origins - jan.origins, jan.origins) AS pct_change,
may.origins - jan.origins AS num_change,
jan.origins AS jan_origins,
may.origins AS may_origins
FROM (
SELECT
date,
app,
origins
FROM
`httparchive.core_web_vitals.technologies`
WHERE
date = '2021-01-01' AND
client = 'mobile' AND
origins >= 1000) AS jan
JOIN (
SELECT
date,
app,
origins
FROM
`httparchive.core_web_vitals.technologies`
WHERE
date = '2021-05-01' AND
client = 'mobile') AS may
USING (app)
ORDER BY
pct_change ASC
Top 20 results:
app | pct_change | num_change | jan_origins | may_origins |
---|---|---|---|---|
Incapsula | -100% | -19,360 | 19,364 | 4 |
Google Code Prettify | -100% | -13,713 | 13,716 | 3 |
CKEditor | -100% | -20,169 | 20,174 | 5 |
Hugo | -100% | -2,732 | 2,733 | 1 |
Pardot | -100% | -13,050 | 13,056 | 6 |
Angular | -100% | -28,733 | 28,760 | 27 |
AlloyUI | -100% | -5,966 | 5,973 | 7 |
INFOnline | -100% | -2,555 | 2,559 | 4 |
Intercom | -100% | -16,677 | 16,706 | 29 |
MobX | -100% | -9,661 | 9,694 | 33 |
VideoJS | -100% | -51,541 | 51,730 | 189 |
SilverStripe | -100% | -2,843 | 2,856 | 13 |
Webtrends | -99% | -1,590 | 1,599 | 9 |
Kampyle | -99% | -1,352 | 1,360 | 8 |
Neto | -99% | -1,097 | 1,106 | 9 |
Twitter Emoji (Twemoji) | -99% | -1,245,734 | 1,260,312 | 14,578 |
Polymer | -95% | -1,019 | 1,069 | 50 |
Open Web Analytics | -95% | -5,521 | 5,811 | 290 |
Disqus | -95% | -26,703 | 28,121 | 1,418 |
Dojo | -95% | -37,429 | 39,498 | 2,069 |
The bug appears to be more widespread than I'd initially thought and some technologies like Angular were almost entirely wiped out.
The June dataset should have the fix partially applied, so we should start to see these rebounding. Eventually everything should be fully counted in the July crawl.
@pmeenan WDYT about adding some kind of automated testing to ensure that the Wappalyzer integration is working? Not sure if that'd be implemented on the WPT or HTTP Archive side, or if it's something easy to build as a standalone WPT API app.
Seeing detections for the 20 most affected technologies in the previous comment starting to recover in the June dataset. For example, here's a screenshot from the CWV Technology Report:
(Twemoji omitted because it's very popular and throws off the y-axis)
Given that the fix was applied late in the June crawl, detections haven't fully recovered, so I'll leave this issue open and continue to monitor this when the July crawl is available.
@rviscomi - Can we close this now?
Yes we can close this now. Tracking improvements to technology detections in https://github.com/HTTPArchive/wappalyzer/issues/70 instead.
Sometimes I am seeing entries which are not in Wappalyzer apps.json file - https://github.com/WPO-Foundation/Wappalyzer/blob/master/src/apps.json
For example, under eCommerce category, we have duplicate entries (With and without spaces).
You can see this in output of query
SELECT distinct app FROM
httparchive.technologies.2020_10_01_mobileWHERE category = 'Ecommerce' order by app
Not sure why this is happening. This is resulting in slight over counting in some queries (For example - Total number of eCommerce platforms analyzed)
We should check why this is happening. Impact of this on 2020 chapter is minimal so I am not spending time to get to bottom of this for now and just raising an issue so that we can look into this later.
Also, If you look at site https://jelly-pop.com/, in technologies table, it shows app as 'SalesforceCommerceCloud' but if you see technologies using Wappalyzer chrome extension, this technology is not shown. Not sure why, this is appearing in technologies table.
Also, noticed this under 'Analytics' category and saw entries like -