[spike] Test static archive of www.data.gov as a temporary solution for migration

mogul commented 3 years ago

User Story

In order to find a way to reduce the pressure of our 90-day deadline to evacuate the FCS environment, the data.gov team wants to make an archiving crawl of the WordPress site to determine whether a automatic static dump is an adequate stopgap to give us time to work on an authored static site.

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

[x] GIVEN a static capture of www.data.gov \ AND I'm not connected to the internet \ WHEN I browse the site \ THEN I don't see 404s for links that would be in the same domain OR a ticket is created to handle the outstanding 404's (done)
[x] GIVEN Federalist is configured to point to a GSA GitHub repo with that static dump in it \ WHEN I browse the Federalist preview URL \ THEN I see a site that looks usable for the public.
[x] We are able to make issues for other gaps that would need to be filled before switching the www.data.gov domain to point to Federalist.

Background

We have 90 days to evacuate the FCS environment. This is a potential stopgap measure to buy us time without having to work out content issues, database dumps, styling/layouts, etc.

Security Considerations (required)

We would only be archiving/publishing information that is already publicly accessible to web-scraping robots and search indexes.

Sketch

Start here: https://www.linuxjournal.com/content/downloading-entire-web-site-wget
...
Profit!

jbrown-xentity commented 3 years ago

To mitigate https://github.com/GSA/datagov-deploy/issues/2873, as a first step to avoid any compliance issues.

jbrown-xentity commented 3 years ago

First pass at building static site is live

Current bugs:

The topics dropdown redirects to Agriculture, doesn't show full list
Some blog pages and/or numbered pages point to live www.data.gov
Open Government link in footer points to live www.data.gov
Twitter plugin updates with current tweets, need to consider removing for static? The tweets are hardcoded into the HTML for the index page, so probably...

jbrown-xentity commented 3 years ago

Also, will of course have to consider a different approach for the contact form, and the list of stack exchange questions

jbrown-xentity commented 3 years ago

Current static crawl has build for www as a subfolder of the entire site. Will need to analyze this folder for any differences, merge any and deprecate www.data.gov.

robert-bryson commented 3 years ago

Docs for Federalist proxy: https://github.com/18F/federalist-proxy

robert-bryson commented 3 years ago

From pairing with James, we think that some of the functionality failing (namely dropdowns not dropping down) is being caused by various js/css files not being referenced correctly.

What we think is happening:

The initial page load is raising a bunch of errors in the console:
The MIME type issue is due to the files being parsed incorrectly (eg making a request for a file with the extension of .4 instead of .js)
To fix, removed the versioning from the file names as well as changed the references to those files to the non-versioned ones. This seems to be working, locally at least, but not built on Federalist yet.

robert-bryson commented 3 years ago

One liner to remove all the versioning info (anything after @ in the filename):

$ find . -iname "*@ver=*" | while read fname; do echo "$fname --> ${fname%@*}"; mv $fname ${fname%@*}; done

Should work on all child dirs as well as pwd.

robert-bryson commented 3 years ago

(href="|src=')(.*?)(@ver=.*?)("|') replacing with $1$2$4 Regex to find/replace all references to versions in filenames. Basically just grouping the parts and leaving out the version part when replacing. This will probably need to be tweaked, thank goodness for source control.

robert-bryson commented 3 years ago

Looks like is the next most common missing file from comparing prod vs local.

robert-bryson commented 3 years ago

Future note for myself:

Counts 200s in fedlog. grep -B 2 '200' fedlog | grep "http" | cut -d " " -f 4 | sort -u | wc -l

Best scan thus far. wget -e robots=off -U mozilla -w 1 --spider -ofedlog --recursive --page-requisites --html-extension --domains federalist-a1176d2b-cb31-49a0-ba20-a47542ee2ec5.app.cloud.gov --no-parent --level=inf https://federalist-a1176d2b-cb31-49a0-ba20-a47542ee2ec5.app.cloud.gov/preview/gsa/datagov-website/bug/src-filenames/

-e robots=off ignores robots.txt
-U mozilla user agent mozilla
-w 1 wait 1 second between requests
--spider spider mode, checks existence. downloads to temp only.
-ofedlog creates output file called fedlog
--recursive recursively searches though html files for new links
--html-extension saves with html extension
--domains <domain> limits to given domain
--no-parent does not search upstream of parent
--level=inf recursive search depth
<url>

jbrown-xentity commented 3 years ago

The limits in #2645 were causing issues on the crawl. We bumped Wordpress x3 to make sure the crawl can finish. We will need to reset it once the crawl is complete.

Note: this may be enough to confirm that the rate limiting is working, although the IP address was not confirmed it was blocked in real time.

mogul commented 3 years ago

Here's a summary of the things we noted which may need addressing.

Forms

There are 2097 places in the static capture where there are form elements. Stripping out suffixes that start with %, #, and @, it looks like there are 11 unique local form targets. They boil down into these five groups (with their approximate locations in the site indicated):

4 variants: "submit your feedback (generally)"
- ../request/index.html
- Contact Us /contact
- contact-us.html
- Contact Us /contact
- contact.html
- Contact Us /contact
- index.html
- request /contact
3 variants: "submit your feedback on this particular topic"
- climate/contact/
- questions or feedback for the climate topic /climate/contact/
- feedback.html
- Feedback form on each climate page
- index.html
- Under /climate, /energy, /food, /local, /maritime
2 variants: "filter the applications in the directory"
- applications
- Applications that use data.gov data /applications
- applications/index.html
- Applications that use data.gov data /applications
1 variant: "submit your own dataset"
- add/index.html
- Add Your Data Catalog to Data.gov /add
1 variant: "submit a comment about the James River Example Vessel Tracks" (?!)
- wp/wp-comments-post.php
- Title "James-river-example-vessel-tracks?!""

We generated this list using:

  grep -rHi 'action="' | cut -d : -f 2- | sed 's/^ *//g' | grep -v tribe |grep -v http | sort -u | \
    cut -d '%' -f 1| sort -u | cut -d \# -f 1| sort -u| cut -d @ -f 1 | sort -u > local-form-target-pages.txt

Things that are normally dynamic

These will display static content until they are replaced with client-side equivalents.

Twitter feed
StackOverflow feed

Form that is just broken

The "Upcoming events" calendar page search page. It posts async to /wp/wp-admin/admin-ajax.php, which doesn't do anything...

But there are no upcoming events! So the behavior is technically correct... for now.

Observations

The dynamic feeds and upcoming event search can be left as-is until after we've transitioned the production domain to Federalist.
We need to address the proliferation of forms at the source, not in the capture
- It would be a pain in the ass to replace all 2097 instances in the static capture, and even then the result would be unmaintainable.
- It's way more efficient and maintainable to turn off/consolidate unnecessary forms upstream in the Wordpress site. This also supports export/conversion to Jekyll/Hugo later.
- Four of the five form types (all but the /applications filter) can be replaced by a client-side Touchpoints form
- It's unclear how to handle the /applications form

robert-bryson commented 3 years ago

Closing. Notes on further crawls available at https://github.com/GSA/datagov-wp-boilerplate/blob/main/crawl.md

GSA / data.gov