Closed titaniumbones closed 5 years ago
A sketched out timeline:
And process on this:
Other things to check:
Ok notes regarding above:
Some quick numbers from what we currently have, just to start to get a feel for where things are at:
For lists of EPA datasets that exist, there is:
So over the last 12 hours or so the lists of urls to examine have kept pouring in, and the number of deduplicated urls to examine remains in the thousands. Sorting, prioritizing, and checking this list is going to take some time, as I don't want to just hand over a list of thousands of urls without engaging in at least some grouping, checking & deduplicating before passing off for fine-grained examination. The best place to meet in relation to confirmed work that needs archiving is still archivers 1.0, which Margaret has been doing an incredible job of building on based on Max's list. Any url that we can't account for will pass through there at some point.
I'm going to put time into collating the much larger list of raw urls into sensible groupings before passing off for consideration. This is going to take until Friday to assemble. I'll keep posting updates here as I go.
Ok, so I really need to double check this work, but I have some preliminary results:
TL;DR; We've archived 37% of the EPA datasets we know about
I'm planning on doing a deep dive on this in the near future, as there are lots & lots of insights to talk about.
@b5 does it make sense to reach back to Max Ogden with these findings and see what of this list the Dat Project already has captured? That could cut it down list further.
If I'm not mistaken the Dat project isn't directly engaged in archiving work, but I'm going to be in Portland for csvconf on Monday, I'll try to sync up with the Dat folks IRL.
Either way we'll start into this list over the weekend and see what shakes out.
@b5 We're (@patcon @dcwalk @shaqsingh) all just sitting here thinking about EPA and I'm looking at your old next-steps list from a month ago:
In partocular, wondering if you've done the first step -- if not, how hard would it be to move on that? e.g., is it something I can do? (is there a way for a user to submit a list of URL's, for instance?)
just a quick update on this, I've been working on the JSON-LD crawl, should have more to report on this by the end of the week. I haven't forgotten, and will report back here!
@b5 provided an update in Slack --
Crawling JSON-LD and review of post-crawl report hasn't happened. This isn't stalled per se, but the goal is to get Data Together crawler closer to the WARC spec before crawling.
@b5 will get a milestone & issues for WARC parity going, and I will reticket appropriate tasks from this here. ETA AUG 15 evening.
Once #199 and #196 are resolved, I think this is good to close. In the interim, moving to the Fall Work Cycle milestone based on our September 11 Archiving call...as this was indicated as an ongoing and important priority.
Per our new stale issues policy:
This issue has been marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.
In the future, a robot will take care of this process!
Closing because https://github.com/edgi-govdata-archiving/overview/issues/199 and https://github.com/edgi-govdata-archiving/overview/issues/196 are closed
Update 2017/08/17 (from @dcwalk): This depends on #199, once that happens this can be resolved!
Our records regarding which EPA datasets we care about and have tried to acquire are scattered. We need to create as definitive as possible a list of these. This means, at a minimum, reconciling the following information sources:
(edit-@dcwalk merged in their list:)
Downloaded data: