HTTPArchive / almanac.httparchive.org

HTTP Archive's annual "State of the Web" report made by the web community
https://almanac.httparchive.org
Apache License 2.0
611 stars 168 forks source link

Wappalyzer Technologies table has unexpected entries #1843

Closed rockeynebhwani closed 2 years ago

rockeynebhwani commented 3 years ago

Sometimes I am seeing entries which are not in Wappalyzer apps.json file - https://github.com/WPO-Foundation/Wappalyzer/blob/master/src/apps.json

For example, under eCommerce category, we have duplicate entries (With and without spaces).

You can see this in output of query

SELECT distinct app FROM httparchive.technologies.2020_10_01_mobile WHERE category = 'Ecommerce' order by app

Not sure why this is happening. This is resulting in slight over counting in some queries (For example - Total number of eCommerce platforms analyzed)

We should check why this is happening. Impact of this on 2020 chapter is minimal so I am not spending time to get to bottom of this for now and just raising an issue so that we can look into this later.

Also, If you look at site https://jelly-pop.com/, in technologies table, it shows app as 'SalesforceCommerceCloud' but if you see technologies using Wappalyzer chrome extension, this technology is not shown. Not sure why, this is appearing in technologies table.

Also, noticed this under 'Analytics' category and saw entries like -

tunetheweb commented 3 years ago

Dunno why this happens but can tell you the outliers are rare enough they can pretty much be ignored:

SELECT DISTINCT
  t1.category,
  t1.app,
  t1.total,
  t2.category,
  t2.app,
  t2.total
FROM
   (SELECT category, app, count(1) AS total FROM `httparchive.technologies.2020_10_01_mobile` GROUP BY category, app) t1,
   (SELECT category, app, count(1) AS total FROM `httparchive.technologies.2020_10_01_mobile` GROUP BY category, app) t2
WHERE
  REPLACE(t1.category, ' ', '') = REPLACE(t2.category, ' ', '') AND
  REPLACE(t1.app, ' ', '') = REPLACE(t2.app, ' ', '') AND
  (t1.category != t1.category OR t1.app != t2.app) AND
  t1.total >= t2.total
ORDER BY
  t1.category,
  t1.app
category app total category_1 app_1 total_1
Advertising Google AdSense 810,985 Advertising GoogleAdSense 1
Analytics Baidu Analytics (百度统计) 19,220 Analytics BaiduAnalytics (百度统计) 1
Analytics Google Analytics 4,618,469 Analytics Google Analytics 1
Analytics Google Analytics 4,618,469 Analytics GoogleAnalytics 30
Analytics GoogleAnalytics 30 Analytics Google Analytics 1
Analytics Tencent Analytics (腾讯分析) 236 Analytics TencentAnalytics(腾讯分析) 1
CDN Netlify 10,958 CDN Netlify 1
CMS Adobe Experience Manager 16,732 CMS AdobeExperience Manager 1
CMS TYPO3 CMS 38,789 CMS TYPO3CMS 1
Ecommerce Cart Functionality 871,039 Ecommerce CartFunctionality 7
Ecommerce Salesforce Commerce Cloud 3,611 Ecommerce SalesforceCommerceCloud 2
Ecommerce SAP Commerce Cloud 2,324 Ecommerce SAPCommerceCloud 1
Font scripts Font Awesome 2,286,759 Font scripts FontAwesome 3
Font scripts FontAwesome 4.7.0 2 Font scripts Font Awesome  4.7.0 1
Font scripts Google Font API 3,292,664 Font scripts GoogleFontAPI 9
JavaScript frameworks Gatsby 7,645 JavaScript frameworks Gatsby 1
JavaScript frameworks React 327,194 JavaScript frameworks React 1
JavaScript graphics Raphael 19,947 JavaScript graphics Raphael 1
JavaScript libraries jQuery Migrate 1,610,580 JavaScript libraries jQueryMigrate 2
JavaScript libraries jQuery UI 1,453,426 JavaScript libraries jQuery UI 2
JavaScript libraries Modernizr 1,084,419 JavaScript libraries Modernizr 1
Maps Google Maps 341,206 Maps GoogleMaps 1
Miscellaneous Google Code Prettify 16,660 Miscellaneous GoogleCodePrettify 2
Miscellaneous Swiper Slider 464,738 Miscellaneous SwiperSlider 2
Miscellaneous Twitter Emoji (Twemoji) 1,634,230 Miscellaneous TwitterEmoji(Twemoji) 1
Miscellaneous webpack 342,236 Miscellaneous webpack 1
Operating systems Windows Server 524,703 Operating systems WindowsServer 20
PaaS Netlify 10,958 PaaS Netlify 1
PaaS WP Engine 75,925 PaaS WPEngine 1
Static site generator Gatsby 7,645 Static site generator Gatsby 1
Tag managers Google Tag Manager 2,590,588 Tag managers GoogleTagManager 13
UI frameworks animate.css 497,980 UI frameworks animate.css 1
UI frameworks Bootstrap 1,989,296 UI frameworks Bootstrap 4
Video players MediaElement.js 250,379 Video players MediaElement.js 1
Web frameworks Microsoft ASP.NET 460,820 Web frameworks MicrosoftASP.NET 11
Web servers Apache Tomcat 22,123 Web servers ApacheTomcat 2
Widgets Facebook 1,894,439 Widgets Facebook 1
Widgets OWL Carousel 576,306 Widgets OWLCarousel 4
Wikis MediaWiki 5,801 Wikis MediaWiki 2
rockeynebhwani commented 3 years ago

Thanks @barrypollard . Agree that it's small enough can be ignored. I have ignored for now.

tunetheweb commented 3 years ago

@pmeenan not urgent, but since you were looking at this code there any ideas on this one? Very small numbers but odd that spaces are stripped very rarely. Remember looking at the time and saw same in WPT for those URLs (but not Wappalyzer website) so think a WPT issue was repeatable. Meant to raise and issue but forgot until you jolted my memory!

pmeenan commented 3 years ago

@bazzadp If you can still repeat it, it would really help to have a repro case. I'm wondering if the Wappalyzer definitions were updated mid-crawl and the spaces were added.

Since then the whole wappalyzer engine was updated and changed so it will be more useful if we can see it in the May 2021 crawl.

tunetheweb commented 3 years ago

https://jelly-pop.com/ still shows it. For example: https://webpagetest.org/jsonResult.php?test=210416_BiDcT0_20469058a349eb72f7a0548367c14bd3&pretty=1 has Google Analytics without a space:

image

Whereas https://almanac.httparchive.org/en/2020/ has Google Analytics with a space: https://webpagetest.org/jsonResult.php?test=210416_BiDcK5_5cd247abc1ef655159c773f30e457da1&pretty=1

image

pmeenan commented 3 years ago

That is bizarre. I wonder if the page itself is overriding some array or other ops because when I take the raw output from wappalyzer and run it through the same code not in the page, it keeps the spaces but if I run it on the console for that page then it strips them out (maybe a code page issue). At least I can reproduce it now though so it should be easier to fix

pmeenan commented 3 years ago

Ahh, looks like the pages override string.trim() and cause it to remove all of the whitespace. Since the Wappalyzer definitions don't have any trailing whitespace I can just remove the trim operations.

Should be fixed now (well, over the next hour as the agent update rolls out).

Screen Shot 2021-04-16 at 5 47 25 PM

tunetheweb commented 3 years ago

Why, why would anyone do this? You get all sorts when you look at 7.5 milllion web pages...

Good work nailing it down.

rviscomi commented 3 years ago

Thanks for tracking that down and fixing @pmeenan!

Can we close this?

tunetheweb commented 3 years ago

Was going to give it a quick check after May crawl and then close it. There was also another issue where the technologies were all messed up as discussed on the HttpArchive slack.

So I say let's leave this open as a reminder to check the technologies results after May crawl as it's a key data for the Web Almanac so want to make sure it's definitely sorted before our crawl month.

tunetheweb commented 3 years ago

Confirmed as all fixed in May crawl. Same query above gives 0 results.

rviscomi commented 3 years ago

Reopening this to track a related issue.

I noticed that "GoDaddy Website Builder" no longer has any websites detected since February:

SELECT
  _TABLE_SUFFIX AS suffix,
  COUNT(DISTINCT url) AS urls
FROM
  `httparchive.technologies.2021_*`
WHERE
  app = 'GoDaddy Website Builder'
GROUP BY
  suffix
ORDER BY
  suffix
suffix      urls
01_01_desktop   7525
01_01_mobile    11006
02_01_desktop   327

Interestingly, ismyhostfastyet.com is still able to detect GWB sites because it uses a signal from the HTTP headers and it's detecting 7k desktop pages as of May.

So I pulled out a URL from that detection and used the Wappalyzer extension for Chrome to get a true positive detection on https://finsbarandgrill.com/

image

However, in HA BQ and a plain WPT, we're only detecting Google Analytics:

SELECT
  *
FROM
  `httparchive.technologies.2021_05_01_desktop`
WHERE
  url = 'https://finsbarandgrill.com/'
url category    app info
https://finsbarandgrill.com/    Analytics   Google Analytics    ""

https://webpagetest.org/jsonResult.php?test=210616_AiDcFQ_67699b36864cd25851abcfd97c53f792&pretty=1

                "detected": {
                    "Analytics": "Google Analytics"
                },
                "detected_apps": {
                    "Google Analytics": ""
                },

Wappalyzer uses a meta[name=generator] signal to detect GWB and the meta tag exists on the page as we'd expect:

https://github.com/AliasIO/wappalyzer/blob/6625a034b17965e9e30234f8a27b4f7f03e64e50/src/technologies.json#L7918-L7934

    "GoDaddy Website Builder": {
      "cats": [
        1
      ],
      "cookies": {
        "dps_site_id": ""
      },
      "icon": "godaddy.svg",
      "meta": {
        "generator": "Go Daddy Website Builder (.+)\\;version:\\1"
      },
      "pricing": [
        "mid"
      ],
      "saas": true,
      "website": "https://www.godaddy.com/websites/website-builder"
    },
// "Starfield Technologies; Go Daddy Website Builder 8.0.0000"
document.querySelector('meta[name=generator]').getAttribute('content')

@pmeenan this leads me to believe that there may be an integration bug with Wappalyzer in WPT. Would you be able to look into this?

pmeenan commented 3 years ago

The Wappalyzer checks weren't including the meta tags. Agent has been updated and will be rolling out over the next hour.

Tested it in dev here and it correctly caught Go Daddy.

"detected": {
    "CMS": "GoDaddy Website Builder 8.0.0000",
    "Analytics": "Google Analytics"
},
"detected_apps": {
    "GoDaddy Website Builder": "8.0.0000",
    "Google Analytics": ""
},
rviscomi commented 3 years ago

Thanks @pmeenan! I also noticed that the Wappalyzer extension detected React on that page, but it's not included in the new test results. Could something else be missing?

pmeenan commented 3 years ago

Possibly the serialized DOM. It serializes the HTML but not the DOM. Taking a look now.

pmeenan commented 3 years ago

Reached out to the Wappalyzer team to see how best to handle DOM-based detections. They are starting to migrate to it but the current engine doesn't directly support it and they have the extension doing the detections manually. Hoping there is a better way but should have something figured out soon.

pmeenan commented 3 years ago

Whew. That was somewhat more painful than I expected. Had to rewrite the JS variable detection part which changed pretty significantly when the engine changed a few months back (also added the support for the DOM detections).

Here is an updated test. Change is rolling out to prod (and HA) over the next hour.

"_detected": {
    "CMS": "GoDaddy Website Builder 8.0.0000",
    "JavaScript libraries": "React 16.13.1,Lodash 4.17.5",
    "Analytics": "Google Analytics"
},
"_detected_apps": {
    "GoDaddy Website Builder": "8.0.0000",
    "React": "16.13.1",
    "Lodash": "4.17.5",
    "Google Analytics": ""
}
rviscomi commented 3 years ago

That's great, thanks for fixing!

rviscomi commented 3 years ago

Something to keep an eye on to verify that the fix is working. Here's a query to measure the change in origins from January to May for all technologies, using the table for the CWV Technology Report dashboard:

SELECT
  app,
  SAFE_DIVIDE(may.origins - jan.origins, jan.origins) AS pct_change,
  may.origins - jan.origins AS num_change,
  jan.origins AS jan_origins,
  may.origins AS may_origins
FROM (
  SELECT
    date,
    app,
    origins
  FROM
    `httparchive.core_web_vitals.technologies`
  WHERE
    date = '2021-01-01' AND
    client = 'mobile' AND
    origins >= 1000) AS jan
JOIN (
  SELECT
    date,
    app,
    origins
  FROM
    `httparchive.core_web_vitals.technologies`
  WHERE
    date = '2021-05-01' AND
    client = 'mobile') AS may
USING (app)
ORDER BY
  pct_change ASC

Top 20 results:

app pct_change num_change jan_origins may_origins
Incapsula -100% -19,360 19,364 4
Google Code Prettify -100% -13,713 13,716 3
CKEditor -100% -20,169 20,174 5
Hugo -100% -2,732 2,733 1
Pardot -100% -13,050 13,056 6
Angular -100% -28,733 28,760 27
AlloyUI -100% -5,966 5,973 7
INFOnline -100% -2,555 2,559 4
Intercom -100% -16,677 16,706 29
MobX -100% -9,661 9,694 33
VideoJS -100% -51,541 51,730 189
SilverStripe -100% -2,843 2,856 13
Webtrends -99% -1,590 1,599 9
Kampyle -99% -1,352 1,360 8
Neto -99% -1,097 1,106 9
Twitter Emoji (Twemoji) -99% -1,245,734 1,260,312 14,578
Polymer -95% -1,019 1,069 50
Open Web Analytics -95% -5,521 5,811 290
Disqus -95% -26,703 28,121 1,418
Dojo -95% -37,429 39,498 2,069

The bug appears to be more widespread than I'd initially thought and some technologies like Angular were almost entirely wiped out.

The June dataset should have the fix partially applied, so we should start to see these rebounding. Eventually everything should be fully counted in the July crawl.

@pmeenan WDYT about adding some kind of automated testing to ensure that the Wappalyzer integration is working? Not sure if that'd be implemented on the WPT or HTTP Archive side, or if it's something easy to build as a standalone WPT API app.

rviscomi commented 3 years ago

Seeing detections for the 20 most affected technologies in the previous comment starting to recover in the June dataset. For example, here's a screenshot from the CWV Technology Report:

image (Twemoji omitted because it's very popular and throws off the y-axis)

Given that the fix was applied late in the June crawl, detections haven't fully recovered, so I'll leave this issue open and continue to monitor this when the July crawl is available.

rockeynebhwani commented 2 years ago

@rviscomi - Can we close this now?

rviscomi commented 2 years ago

Yes we can close this now. Tracking improvements to technology detections in https://github.com/HTTPArchive/data-pipeline/issues/31 instead.