GSA-TTS / jemison

An exploration of the space of search
Other
0 stars 0 forks source link

:calendar: Implement a `entree` service #9

Open jadudm opened 5 days ago

jadudm commented 5 days ago

At a glance

In order to only visit things once in a crawl as a developer I want a way to know when I last grabbed a page.

Acceptance Criteria

We use DRY behavior-driven development wherever possible.

### then...
- [ ] https://github.com/GSA-TTS/jemison/issues/12

Shepherd

Background

If we adhere to a politeness time of 2s/page (meaning, a domain only gets hit once every two seconds), then timing to fetch an entire site looks like...

Pages Time Max freq
1 2s hourly
10 20s hourly
100 4m hourly
1000 30m daily
10,000 6h daily
100,000 2.5d weekly
1,000,000 25d monthly*

At 1M pages, we are edging into a bi-monthly schedule (at best). This will likely lead to a space where, for large domains (e.g. va.gov), we may end up with subdomains as a primary target, and the rates for subdomains will vary. (More complex will be per-page timing within a domain. We should assume it will be needed...)

At 1M pages, we might be less polite, if we have to sustain a monthly cadence... 1s/page takes us down to ~15 days for a full crawl. Note that these numbers all expand somewhat when PDFs are factored in; although it is a single 'fetch' operation, a single PDF might be 2000+ pages.

A entree service keeps track of who we've visited, and potentially who we are yet to visit, and when.

sequenceDiagram
  participant W as Walk
  participant Q as Queue
  participant I as Entree
  participant S as DB
  W ->> Q: Enqueue page crawl
  I ->>+ Q: Check
  Q -->>- I: Job
  I ->>+ S: GET
  S -->>- I: Obj
  alt ready?
    I ->> Q: Fetch
    note left of I: update metadata
  end

The metadata DB needs to hold everything we need to check whether a visit should happen. It probably contains...

The hash(scheme, host, port) is the primary key; this gives us uniqueness per page. The question is---how do we know if we should revisit? This involves the rules table. This is keyed on host, meaning rules table will look like

The rule column contains a JSON object that defines a rule we will evaluate.

{
  "type": "frequency",
  "value": "hourly"  
}

The inputs to the rule is the row from the entree table. It is always a single rule, but a small language defines what is possible:

rule     := rule_exp
rule_exp := {type: "all-of" rules: [ rule_exp+ ]}
          | {type: "one-of" rules: [ rule_exp+ ]}
          | {type: "frequency" value: FREQ_CONST }
          | {type: "path" regex: string } 
FREQ_CONST: "hourly" | "daily" | "weekly" | "monthly" | "quarterly" | "bi-annually"

This allows expression of things like

{ 
  type: "one-of" 
  rules: [
    {
      type: "all-of" rules: [
        {type: "path" root: "/something-frequent/*"},
        {type: "frequency", "weekly"},
      ]
    },
    {
      type: "all-of" rules: [ 
        {type: "host" root: "/*"},
        {type: "frequency", "weekly"},
      ]
    }
  ]
}

This way, a single rule can be written for a domain, and we can break out paths as needed, all the way down to individual pages.

The rules will be expressed in JSonnet, allowing for us to check them statically.

Security Considerations

Required per CM-4.


Process checklist - [ ] Has a clear story statement - [ ] Can reasonably be done in a few days (otherwise, split this up!) - [ ] Shepherds have been identified - [ ] UX youexes all the things - [ ] Design designs all the things - [ ] Engineering engineers all the things - [ ] Meets acceptance criteria - [ ] Meets [QASP conditions](https://derisking-guide.18f.gov/qasp/) - [ ] Presented in a review - [ ] Includes screenshots or references to artifacts - [ ] Tagged with the sprint where it was finished - [ ] Archived ### If there's UI... - [ ] Screen reader - Listen to the experience with a screen reader extension, ensure the information presented in order - [ ] Keyboard navigation - Run through acceptance criteria with keyboard tabs, ensure it works. - [ ] Text scaling - Adjust viewport to 1280 pixels wide and zoom to 200%, ensure everything renders as expected. Document 400% zoom issues with USWDS if appropriate.
jadudm commented 1 day ago

The rules notion does not belong in the DB. It should be in a config file (initially).

jadudm commented 1 day ago

Instead of a ruleset (here), we should just compute next_fetch. If an override comes through (e.g. a fetch request that needs to ignore the date), we do that. However, the next fetch is based on a schedule, and can be computed. We don't need to do a computation every time.

Simple first.