GSA / site-scanning

The central repository for the Site Scanning program
https://digital.gov/site-scanning
12 stars 2 forks source link

Expand the CMS scan's ability to detect #379

Closed gbinal closed 1 year ago

gbinal commented 1 year ago

Following up on the great work of #311, we should see if we can improve the metholodogy so as to detect more examples of CMS in action.

To elaborate, we are looking for about 30 different snippets that would suggest different CMSes but are only seeing results for 4-5 CMS.

Based on @benbalter's good work (summary here), I would expect to see more CMSes represented. Especially sharepoint, joomla, percussion, liferay maybe? (source data here)

akuny commented 1 year ago

On June 1, a new feature that also scans HTTP response's headers for evidence of CMS usage was deployed. Below are the counts for CMS detection after this feature was introduced: a lot of new Drupal hits, but no additional CMS detection.

Screenshot 2023-06-02 at 8 30 50 AM
akuny commented 1 year ago

The scan engine has been updated to detect Microsoft Sharepoint via CMS headers--updated totals below.

Screenshot 2023-06-02 at 1 10 26 PM
gbinal commented 1 year ago

Great progress!!!!!

For a next step, let's take this list, which Ben had previously been finding, and look at how his code looks for them, and if the method is HTML or header sniffing, then to gutcheck whether there's any reason why we might not be seeing them.

akuny commented 1 year ago

Here's some info regarding the CMSs we're not picking up.

We currently don't have any HTML snippets or HTTP response headers to scan for the following CMSs:

We are scanning for the following CMSs using the means specified below. If we have particular examples of sites that are using any one of these three CMSs, then I can try to determine why they aren't being picked up.

gbinal commented 1 year ago

Gotcha. So, for now, let's set aside the first 7 since we're not necessarily looking to further expand the detection methodologies.

For the latter 3, yeah, let's look at Ben's data to see if the ones he's found are ones that we should have, too, or maybe they have changed CMSes.

gbinal commented 1 year ago

Note that if we expand the methodologies in the future, we should include wagtail CMS since that was requested.

akuny commented 1 year ago

For Joomla, Percussion, and SilverStripe, I think those are being detected in Ben's data by way of looking for a meta element with a generator attribute containing a certain value. In the case of dni.gov:

<meta name="generator" content="Joomla! - Open Source Content Management">

We currently are not looking for these elements, which is why we don't have hits for these three CMSs.

gbinal commented 1 year ago

Makes sense. In our conversation, I think you said that we could pretty easily expand our html sniffing to cover those too, so that's the next step.

gbinal commented 1 year ago

this is done, in local development. We can close it when the deploy is done and confirmed....

akuny commented 1 year ago

These changes are live and the snapshot has been updated.

gbinal commented 1 year ago

Great!

gbinal commented 1 year ago

this is done - great work.