Closed patcon closed 6 years ago
Would love to describe pipeline somewhere near the top as well:
https://envirodatagov.org/website-monitoring/
My impression of what our pipeline does is:
Does this sound right enough to put in README?
Platform regularly scrapes target websites.
Maybe it should, but it doesn’t. It relies on other services to do this.
Platform processes data and sifts out meaningful changes for volunteer analysts.
We’d like it to, but it doesn’t.
Volunteer analysts further sift out meaningful changes for domain experts. Domain experts qualify meaningful changes for journalist partners.
These are the same people and there is no separate process here. I think I would just say “volunteers and domain experts work together to find meaningful changes and qualify them for journalists” or something.
Thanks @Mr0grog :)
I'll try some rephrases:
Platform ~regularly~ gathers periodic scrapes of target websites.
That work better?
Platform processes data and sifts out meaningful changes for volunteer analysts. (Not yet implemented.)
Having suggsted the above, I'm having second thoughts: even the "diff comparison" we used to do in spreadsheet-land is still "processing", right? I know we have grander aspirations, but would it be simpler (and not too disingenuous) to avoid muddling things? Basically, if there's any filtering happening at all for analysts, I feel it might be ok to refer to "processing". And the folks working on the project can be concerned with the minutiae of that not using ML yet. Curious your thoughts!
Volunteers and ~domain~ experts work together to further sift out meaningful changes and qualify them for journalists ~partners~.
Thanks rob, your reframe is great. The phrase I kept related to "sifting" was because "find meaningful changes" is not a phrase we can easily keep using in multiple step. (Reader inner voice:"Why is it finding again, when it was found in the step before?") "Sift" and "further sift" was my way of minimizing the noise of new words/concepts reading from line to line. I'd prefer to keep that, if no strong feelings or alternative suggestions.
So the whole thing becomes:
Happy to move this to a GDoc if it's better to work there
Platform gathers periodic scrapes of target websites.
This still sounds like it’s saying we scrape target websites. I think this needs to be clear so people aren’t confused or looking for something that’s not here: no part of our infrastructure ever touches epa.gov
or doe.gov
or any of the websites we are monitoring.
Basically, if there's any filtering happening at all for analysts, I feel it might be ok to refer to "processing".
Let me be extra clear here: there is no filtering happening for analysts, period. There is no part of this system that currently does anything proactive before analysts touch it and make a decision except collecting snapshots of web pages. There is no processing, filtering, or analysis of any sort at all applied to them before an analyst looks at them.
"find meaningful changes" is not a phrase we can easily keep using in multiple step. (Reader inner voice:"Why is it finding again, when it was found in the step before?")
Right. What I’m trying to make clear here is: nothing was found in the step before. That is specifically why I did not say “further.” If you want to keep the previous step but say “(not implemented),” then sure, “further” makes sense.
Hey @patcon, do you have the bandwidth to push this across the finish line?
Screw it, that didn’t work at all. I’ll add links to the line on each of my comments 🙄
@Mr0grog I think I was able to address all comments. Let me know how it looks. Thanks!
Woot 🎉
Thanks @patcon for spear-heading this and @Mr0grog for taking the time to review =)
Are these improvements?
Also, I'm wondering whether some of the intro docs are out-of-date. Is the face that there\'s currently "a lot of manual labor" still important to mention so prominently?