edgi-govdata-archiving / web-monitoring

Documentation and project-wide issues for the Website Monitoring project (a.k.a. "Scanner")
Creative Commons Attribution Share Alike 4.0 International
105 stars 17 forks source link

Clarify README intro #105

Closed patcon closed 6 years ago

patcon commented 6 years ago

Are these improvements?

Also, I'm wondering whether some of the intro docs are out-of-date. Is the face that there\'s currently "a lot of manual labor" still important to mention so prominently?

patcon commented 6 years ago

Would love to describe pipeline somewhere near the top as well:

https://envirodatagov.org/website-monitoring/

My impression of what our pipeline does is:

  1. Platform regularly scrapes target websites.
  2. Platform processes data and sifts out meaningful changes for volunteer analysts.
  3. Volunteer analysts further sift out meaningful changes for domain experts.
  4. Domain experts qualify meaningful changes for journalist partners.
  5. Journalist partners amplify stories for the wider public.

Does this sound right enough to put in README?

Mr0grog commented 6 years ago

Platform regularly scrapes target websites.

Maybe it should, but it doesn’t. It relies on other services to do this.

Platform processes data and sifts out meaningful changes for volunteer analysts.

We’d like it to, but it doesn’t.

Volunteer analysts further sift out meaningful changes for domain experts. Domain experts qualify meaningful changes for journalist partners.

These are the same people and there is no separate process here. I think I would just say “volunteers and domain experts work together to find meaningful changes and qualify them for journalists” or something.

patcon commented 6 years ago

Thanks @Mr0grog :)

I'll try some rephrases:

Platform ~regularly~ gathers periodic scrapes of target websites.

That work better?

Platform processes data and sifts out meaningful changes for volunteer analysts. (Not yet implemented.)

Having suggsted the above, I'm having second thoughts: even the "diff comparison" we used to do in spreadsheet-land is still "processing", right? I know we have grander aspirations, but would it be simpler (and not too disingenuous) to avoid muddling things? Basically, if there's any filtering happening at all for analysts, I feel it might be ok to refer to "processing". And the folks working on the project can be concerned with the minutiae of that not using ML yet. Curious your thoughts!

Volunteers and ~domain~ experts work together to further sift out meaningful changes and qualify them for journalists ~partners~.

Thanks rob, your reframe is great. The phrase I kept related to "sifting" was because "find meaningful changes" is not a phrase we can easily keep using in multiple step. (Reader inner voice:"Why is it finding again, when it was found in the step before?") "Sift" and "further sift" was my way of minimizing the noise of new words/concepts reading from line to line. I'd prefer to keep that, if no strong feelings or alternative suggestions.

So the whole thing becomes:

  1. Platform gathers periodic scrapes of target websites.
  2. Platform processes data to sift out meaningful changes for volunteer analysts. (Not yet implemented.)
  3. Volunteers and experts work together to further sift out meaningful changes and qualify them for journalists.
  4. Journalists amplify stories for the wider public.
patcon commented 6 years ago

Happy to move this to a GDoc if it's better to work there

Mr0grog commented 6 years ago

Platform gathers periodic scrapes of target websites.

This still sounds like it’s saying we scrape target websites. I think this needs to be clear so people aren’t confused or looking for something that’s not here: no part of our infrastructure ever touches epa.gov or doe.gov or any of the websites we are monitoring.

Basically, if there's any filtering happening at all for analysts, I feel it might be ok to refer to "processing".

Let me be extra clear here: there is no filtering happening for analysts, period. There is no part of this system that currently does anything proactive before analysts touch it and make a decision except collecting snapshots of web pages. There is no processing, filtering, or analysis of any sort at all applied to them before an analyst looks at them.

"find meaningful changes" is not a phrase we can easily keep using in multiple step. (Reader inner voice:"Why is it finding again, when it was found in the step before?")

Right. What I’m trying to make clear here is: nothing was found in the step before. That is specifically why I did not say “further.” If you want to keep the previous step but say “(not implemented),” then sure, “further” makes sense.

danielballan commented 6 years ago

Hey @patcon, do you have the bandwidth to push this across the finish line?

Mr0grog commented 6 years ago

Screw it, that didn’t work at all. I’ll add links to the line on each of my comments 🙄

lightandluck commented 6 years ago

@Mr0grog I think I was able to address all comments. Let me know how it looks. Thanks!

lightandluck commented 6 years ago

Woot 🎉
Thanks @patcon for spear-heading this and @Mr0grog for taking the time to review =)