Open jadudm opened 5 days ago
The rules notion does not belong in the DB. It should be in a config file (initially).
Instead of a ruleset (here), we should just compute next_fetch
. If an override comes through (e.g. a fetch request that needs to ignore the date), we do that. However, the next fetch is based on a schedule, and can be computed. We don't need to do a computation every time.
Simple first.
At a glance
In order to only visit things once in a crawl as a developer I want a way to know when I last grabbed a page.
Acceptance Criteria
We use DRY behavior-driven development wherever possible.
Shepherd
Background
If we adhere to a politeness time of 2s/page (meaning, a domain only gets hit once every two seconds), then timing to fetch an entire site looks like...
At 1M pages, we are edging into a bi-monthly schedule (at best). This will likely lead to a space where, for large domains (e.g.
va.gov
), we may end up with subdomains as a primary target, and the rates for subdomains will vary. (More complex will be per-page timing within a domain. We should assume it will be needed...)At 1M pages, we might be less polite, if we have to sustain a monthly cadence... 1s/page takes us down to ~15 days for a full crawl. Note that these numbers all expand somewhat when PDFs are factored in; although it is a single 'fetch' operation, a single PDF might be 2000+ pages.
A
entree
service keeps track of who we've visited, and potentially who we are yet to visit, and when.The metadata DB needs to hold everything we need to check whether a visit should happen. It probably contains...
The
hash(scheme, host, port)
is the primary key; this gives us uniqueness per page. The question is---how do we know if we should revisit? This involves therules
table. This is keyed on host, meaningrules
table will look likenull
)The rule column contains a JSON object that defines a rule we will evaluate.
The inputs to the rule is the row from the
entree
table. It is always a single rule, but a small language defines what is possible:This allows expression of things like
This way, a single rule can be written for a domain, and we can break out paths as needed, all the way down to individual pages.
The rules will be expressed in JSonnet, allowing for us to check them statically.
Security Considerations
Required per CM-4.
Process checklist
- [ ] Has a clear story statement - [ ] Can reasonably be done in a few days (otherwise, split this up!) - [ ] Shepherds have been identified - [ ] UX youexes all the things - [ ] Design designs all the things - [ ] Engineering engineers all the things - [ ] Meets acceptance criteria - [ ] Meets [QASP conditions](https://derisking-guide.18f.gov/qasp/) - [ ] Presented in a review - [ ] Includes screenshots or references to artifacts - [ ] Tagged with the sprint where it was finished - [ ] Archived ### If there's UI... - [ ] Screen reader - Listen to the experience with a screen reader extension, ensure the information presented in order - [ ] Keyboard navigation - Run through acceptance criteria with keyboard tabs, ensure it works. - [ ] Text scaling - Adjust viewport to 1280 pixels wide and zoom to 200%, ensure everything renders as expected. Document 400% zoom issues with USWDS if appropriate.