edgi-govdata-archiving / overview

🎈 Start here for current projects, how to get involved, and joining community calls, a resource for new and veteran members
GNU General Public License v3.0
118 stars 20 forks source link

Establish how many EPA datasets we have identified, and determine their disposition #120

Closed titaniumbones closed 5 years ago

titaniumbones commented 7 years ago

Update 2017/08/17 (from @dcwalk): This depends on #199, once that happens this can be resolved!


Our records regarding which EPA datasets we care about and have tried to acquire are scattered. We need to create as definitive as possible a list of these. This means, at a minimum, reconciling the following information sources:

(edit-@dcwalk merged in their list:)

Downloaded data:

dcwalk commented 7 years ago

A sketched out timeline:

And process on this:

  1. TUES Toronto we’ll be working in the archivers.space app to research/harvest the list of datasets
  2. Guide upcoming datarescues to do the same, and reach out to more groups to do so
  3. Add more uncrawlables to archivers.space for EPA from the reconciliation across dataset tracking
dcwalk commented 7 years ago

Other things to check:

b5 commented 7 years ago

Ok notes regarding above:

Stats

Some quick numbers from what we currently have, just to start to get a feel for where things are at:

chrome extension

archivers 1.0 epa.gov urls:

archivers 2.0 epa.gov urls:

Next Steps:

murphyofglad commented 7 years ago

For lists of EPA datasets that exist, there is:

b5 commented 7 years ago

So over the last 12 hours or so the lists of urls to examine have kept pouring in, and the number of deduplicated urls to examine remains in the thousands. Sorting, prioritizing, and checking this list is going to take some time, as I don't want to just hand over a list of thousands of urls without engaging in at least some grouping, checking & deduplicating before passing off for fine-grained examination. The best place to meet in relation to confirmed work that needs archiving is still archivers 1.0, which Margaret has been doing an incredible job of building on based on Max's list. Any url that we can't account for will pass through there at some point.

I'm going to put time into collating the much larger list of raw urls into sensible groupings before passing off for consideration. This is going to take until Friday to assemble. I'll keep posting updates here as I go.

b5 commented 7 years ago

Ok, so I really need to double check this work, but I have some preliminary results:

TL;DR; We've archived 37% of the EPA datasets we know about

I'm planning on doing a deep dive on this in the near future, as there are lots & lots of insights to talk about.

Initial observations:

Immediate Questions:

Next Steps:

murphyofglad commented 7 years ago

@b5 does it make sense to reach back to Max Ogden with these findings and see what of this list the Dat Project already has captured? That could cut it down list further.

b5 commented 7 years ago

If I'm not mistaken the Dat project isn't directly engaged in archiving work, but I'm going to be in Portland for csvconf on Monday, I'll try to sync up with the Dat folks IRL.

Either way we'll start into this list over the weekend and see what shakes out.

titaniumbones commented 7 years ago

@b5 We're (@patcon @dcwalk @shaqsingh) all just sitting here thinking about EPA and I'm looking at your old next-steps list from a month ago:

In partocular, wondering if you've done the first step -- if not, how hard would it be to move on that? e.g., is it something I can do? (is there a way for a user to submit a list of URL's, for instance?)

b5 commented 7 years ago

just a quick update on this, I've been working on the JSON-LD crawl, should have more to report on this by the end of the week. I haven't forgotten, and will report back here!

dcwalk commented 7 years ago

@b5 provided an update in Slack --

Crawling JSON-LD and review of post-crawl report hasn't happened. This isn't stalled per se, but the goal is to get Data Together crawler closer to the WARC spec before crawling.

@b5 will get a milestone & issues for WARC parity going, and I will reticket appropriate tasks from this here. ETA AUG 15 evening.

dcwalk commented 7 years ago

Once #199 and #196 are resolved, I think this is good to close. In the interim, moving to the Fall Work Cycle milestone based on our September 11 Archiving call...as this was indicated as an ongoing and important priority.

Frijol commented 5 years ago

Per our new stale issues policy:

This issue has been marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.

In the future, a robot will take care of this process!

Frijol commented 5 years ago

Closing because https://github.com/edgi-govdata-archiving/overview/issues/199 and https://github.com/edgi-govdata-archiving/overview/issues/196 are closed