Closed mogul closed 3 years ago
To mitigate https://github.com/GSA/datagov-deploy/issues/2873, as a first step to avoid any compliance issues.
First pass at building static site is live
Current bugs:
Also, will of course have to consider a different approach for the contact form, and the list of stack exchange questions
Current static crawl has build for www as a subfolder of the entire site. Will need to analyze this folder for any differences, merge any and deprecate www.data.gov
.
Docs for Federalist proxy: https://github.com/18F/federalist-proxy
From pairing with James, we think that some of the functionality failing (namely dropdowns not dropping down) is being caused by various js/css files not being referenced correctly.
What we think is happening:
One liner to remove all the versioning info (anything after @ in the filename):
$ find . -iname "*@ver=*" | while read fname; do echo "$fname --> ${fname%@*}"; mv $fname ${fname%@*}; done
Should work on all child dirs as well as pwd.
(href="|src=')(.*?)(@ver=.*?)("|')
replacing with $1$2$4
Regex to find/replace all references to versions in filenames. Basically just grouping the parts and leaving out the version part when replacing. This will probably need to be tweaked, thank goodness for source control.
Looks like is the next most common missing file from comparing prod vs local.
Future note for myself:
Counts 200s in fedlog.
grep -B 2 '200' fedlog | grep "http" | cut -d " " -f 4 | sort -u | wc -l
Best scan thus far.
wget -e robots=off -U mozilla -w 1 --spider -ofedlog --recursive --page-requisites --html-extension --domains federalist-a1176d2b-cb31-49a0-ba20-a47542ee2ec5.app.cloud.gov --no-parent --level=inf https://federalist-a1176d2b-cb31-49a0-ba20-a47542ee2ec5.app.cloud.gov/preview/gsa/datagov-website/bug/src-filenames/
-e robots=off
ignores robots.txt-U mozilla
user agent mozilla-w 1
wait 1 second between requests--spider
spider mode, checks existence. downloads to temp only.-ofedlog
creates output file called fedlog--recursive
recursively searches though html files for new links--html-extension
saves with html extension--domains <domain>
limits to given domain--no-parent
does not search upstream of parent--level=inf
recursive search depth<url>
The limits in #2645 were causing issues on the crawl. We bumped Wordpress x3 to make sure the crawl can finish. We will need to reset it once the crawl is complete.
Note: this may be enough to confirm that the rate limiting is working, although the IP address was not confirmed it was blocked in real time.
Here's a summary of the things we noted which may need addressing.
There are 2097 places in the static capture where there are form elements. Stripping out suffixes that start with %, #, and @, it looks like there are 11 unique local form targets. They boil down into these five groups (with their approximate locations in the site indicated):
We generated this list using:
grep -rHi 'action="' | cut -d : -f 2- | sed 's/^ *//g' | grep -v tribe |grep -v http | sort -u | \
cut -d '%' -f 1| sort -u | cut -d \# -f 1| sort -u| cut -d @ -f 1 | sort -u > local-form-target-pages.txt
These will display static content until they are replaced with client-side equivalents.
The "Upcoming events" calendar page search page. It posts async to /wp/wp-admin/admin-ajax.php, which doesn't do anything...
Closing. Notes on further crawls available at https://github.com/GSA/datagov-wp-boilerplate/blob/main/crawl.md
User Story
In order to find a way to reduce the pressure of our 90-day deadline to evacuate the FCS environment, the data.gov team wants to make an archiving crawl of the WordPress site to determine whether a automatic static dump is an adequate stopgap to give us time to work on an authored static site.
Acceptance Criteria
[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]
Background
We have 90 days to evacuate the FCS environment. This is a potential stopgap measure to buy us time without having to work out content issues, database dumps, styling/layouts, etc.
Security Considerations (required)
We would only be archiving/publishing information that is already publicly accessible to web-scraping robots and search indexes.
Sketch