edgi-govdata-archiving / web-monitoring-ui

UI to enable analysts to quickly assess changes to monitored government websites
GNU General Public License v3.0
37 stars 38 forks source link

Summary of my understanding of the task based on group call #16

Closed danielballan closed 7 years ago

danielballan commented 7 years ago

I invite others to comment on or edit the following. After discussion maybe it can more to a README or somewhere more permanent. Apologies if it duplicates info available elsewhere.

There is a spreadsheet with a list of 30 000 URLs of government web pages of interest. The pages at these URLS are captured every ~3 days for PageFreezer. Roughly ~80% of these URLs happen to be on the same ~150 domains and subdomains (and all .gov). Weekly, these subdomains are captured recursively and stored in zip files.

Current workflow: Versionista flags changes to 25 000 URLs. Our versionista-outputter runs and populates a row for each changed URL into a series of CSVs, one for each of the ~100 domains being tracked - these CSVs are distributed to analysts who copy the rows into Google spreadsheets (and this should be automated). Analysts then hand label for 12 different "types of changes" and 6 different "types of significance" - each row can receive multiple labels. This system, in which no filters are used, will typically produce ~3 000-5 000 changes over the course of 3 days, from the total 25 000 URLs.

Our goal is to filter and/or prioritize those rows to direct analysts to important differences and also to provide more useful columns that will help them judge which rows to follow up on. Additionally, one column should be a link to visual diff, something with versionista currently provides but PageFreezer currently does not.

To start, two-tier prioritization:

ambergman commented 7 years ago

Thanks so much for the summary @danielballan. Just added a little to the "Current Workflow" section, but I think it looks good, and I'm sorry I've been slow to get this workflow described so that the new tools being developed can fit in well. Definitely like the idea of getting it into a README - hopefully by the end of the weekend.

Also, if it's alright with you Dan, do you think we could retitle the issue something like: "Versionista Workflow and Two-Tier Prioritization"? I think it would be great for reference later.

titaniumbones commented 7 years ago

I wrote this on the plane and it looks to me like it's a little out of date now, but since I wrote my next-steps doc on the basis of it, cheking in here for feedback. I'll try to get my doc into a hackmd pad so folks can edit it


For tier 1, how is this summary of the architecture we need:

And if that's the architecture, which of these do we want to target for tomorrow in SF, and which are already underway by @danielballan & others? I actually kind of hate that namespace so better names welcome.

dcwalk commented 7 years ago

Closing as this has been moved to the main repo wiki: https://github.com/edgi-govdata-archiving/web-monitoring/wiki/Group-Call-Summary-2017-02-08