GSA / data.gov

Main repository for the data.gov service
https://data.gov
Other
660 stars 102 forks source link

[spike] Test static archive of www.data.gov as a temporary solution for migration #3465

Closed mogul closed 3 years ago

mogul commented 3 years ago

User Story

In order to find a way to reduce the pressure of our 90-day deadline to evacuate the FCS environment, the data.gov team wants to make an archiving crawl of the WordPress site to determine whether a automatic static dump is an adequate stopgap to give us time to work on an authored static site.

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

Background

We have 90 days to evacuate the FCS environment. This is a potential stopgap measure to buy us time without having to work out content issues, database dumps, styling/layouts, etc.

Security Considerations (required)

We would only be archiving/publishing information that is already publicly accessible to web-scraping robots and search indexes.

Sketch

jbrown-xentity commented 3 years ago

To mitigate https://github.com/GSA/datagov-deploy/issues/2873, as a first step to avoid any compliance issues.

jbrown-xentity commented 3 years ago

First pass at building static site is live

Current bugs:

jbrown-xentity commented 3 years ago

Also, will of course have to consider a different approach for the contact form, and the list of stack exchange questions

jbrown-xentity commented 3 years ago

Current static crawl has build for www as a subfolder of the entire site. Will need to analyze this folder for any differences, merge any and deprecate www.data.gov.

robert-bryson commented 3 years ago

Docs for Federalist proxy: https://github.com/18F/federalist-proxy

robert-bryson commented 3 years ago

From pairing with James, we think that some of the functionality failing (namely dropdowns not dropping down) is being caused by various js/css files not being referenced correctly.

What we think is happening:

robert-bryson commented 3 years ago

One liner to remove all the versioning info (anything after @ in the filename):

$ find . -iname "*@ver=*" | while read fname; do echo "$fname --> ${fname%@*}"; mv $fname ${fname%@*}; done

Should work on all child dirs as well as pwd.

robert-bryson commented 3 years ago

image

(href="|src=')(.*?)(@ver=.*?)("|') replacing with $1$2$4 Regex to find/replace all references to versions in filenames. Basically just grouping the parts and leaving out the version part when replacing. This will probably need to be tweaked, thank goodness for source control.

robert-bryson commented 3 years ago

Looks like image is the next most common missing file from comparing prod vs local.

robert-bryson commented 3 years ago

Future note for myself:

Counts 200s in fedlog. grep -B 2 '200' fedlog | grep "http" | cut -d " " -f 4 | sort -u | wc -l

Best scan thus far. wget -e robots=off -U mozilla -w 1 --spider -ofedlog --recursive --page-requisites --html-extension --domains federalist-a1176d2b-cb31-49a0-ba20-a47542ee2ec5.app.cloud.gov --no-parent --level=inf https://federalist-a1176d2b-cb31-49a0-ba20-a47542ee2ec5.app.cloud.gov/preview/gsa/datagov-website/bug/src-filenames/

jbrown-xentity commented 3 years ago

The limits in #2645 were causing issues on the crawl. We bumped Wordpress x3 to make sure the crawl can finish. We will need to reset it once the crawl is complete.

Note: this may be enough to confirm that the rate limiting is working, although the IP address was not confirmed it was blocked in real time.

mogul commented 3 years ago

Here's a summary of the things we noted which may need addressing.

Forms

There are 2097 places in the static capture where there are form elements. Stripping out suffixes that start with %, #, and @, it looks like there are 11 unique local form targets. They boil down into these five groups (with their approximate locations in the site indicated):

We generated this list using:

  grep -rHi 'action="' | cut -d : -f 2- | sed 's/^ *//g' | grep -v tribe |grep -v http | sort -u | \
    cut -d '%' -f 1| sort -u | cut -d \# -f 1| sort -u| cut -d @ -f 1 | sort -u > local-form-target-pages.txt

Things that are normally dynamic

These will display static content until they are replaced with client-side equivalents.

Form that is just broken

The "Upcoming events" calendar page search page. It posts async to /wp/wp-admin/admin-ajax.php, which doesn't do anything...


Observations

robert-bryson commented 3 years ago

Closing. Notes on further crawls available at https://github.com/GSA/datagov-wp-boilerplate/blob/main/crawl.md