-
Long ago, we worked around an issue where we were getting lots of connection failures from Wayback with a dirty hack: if we ran out of retries but still had a failure to establish a new connection, we…
-
Some pages have a `` element in their markup, indicating a correct, “canonical” URL for the page (some more info here: https://en.wikipedia.org/wiki/Canonical_link_element). When importing data from …
-
I see the rationale for only letting users with, say, the "bagger" role _edit_ the bag section, but is there a good reason not to let all users _see_ all the data fields, particularly since work done …
-
We should consider factoring in the absolute number of changed characters or words into the how textual changes contribute to priority. In extremely large pages, even a large change (which is worth lo…
-
In accordance with edgi-govdata-archiving/overview#217, we’re planning to archive this repo — it hasn’t seen updates in a year and a half and isn’t being actively maintained. It was originally made to…
-
In the HTML diff, we have a [minimum diff length of 2 tokens](https://github.com/edgi-govdata-archiving/web-monitoring-processing/blob/master/web_monitoring/html_diff_render.py#L1624-L1638) (inherited…
-
We currently go through a lot of effort to make our added/removed markup sit _inside_ “block-level” tags and _outside_ other, “inline” tags (see [`merge_changes()`](https://github.com/edgi-govdata-arc…
-
* *Agency*: US Department of Agriculture
* *Agency Division*: Forest Service
* *Data Type*: US forest inventory plots
* *Data Format*: CSV, ZIP
* *FTP/HTTP URL*: https://apps.fs.usda.gov/fia/data…
-
Would be a great to generate a summary of work done at an event, so attendees can enjoy the sense of accomplishment (which will hopefully motivate them to keep working on the project!), without organi…
-
_From @librlaurie on February 2, 2017 19:25_
Checker role:
Answers three questions:
1. Does the data match what's on the website (and how did you test that?)
1. Are all of the files there that nee…