Establish how many EPA datasets we have identified, and determine their disposition

titaniumbones commented 7 years ago

Update 2017/08/17 (from @dcwalk): This depends on #199, once that happens this can be resolved!

Our records regarding which EPA datasets we care about and have tried to acquire are scattered. We need to create as definitive as possible a list of these. This means, at a minimum, reconciling the following information sources:

(edit-@dcwalk merged in their list:)

pre-datarefuge list of uncrawlables. Maintained in a spreadsheet tracks downloaded data
several google spreadsheets from the data refuge era. ~Maintained by @trinberg~ I think this is archived tracks downloaded data
master chrome seed sheet @trinberg does not track downloaded data (shared privately)
archivers 1.0 listings. Maintained by @b5 tracks uncrawlables, 'crawlable,' and downloaded data
archivers 2.0 listings. Maintained by @b5 tracks downloaded data (different model)
SC EPA datasets of concern does not track downloaded data, most likely covered (shared privately, ping @dcwalk)
Internal EDGI EPA tracking sheet downloading tracked in archivers.space (supersedes earlier list)

Downloaded data:

datarefuge.org CKAN instance. Maintained by datarefuge listing of final downloaded data
climate-mirror
DR Boulder Platform

dcwalk commented 7 years ago

A sketched out timeline:

[x] @b5 EDG (edg.epa.gov) set up in alpha.archivers.space ASAP 0424
[x] @b5 de-dupe a list of EPA we know about and status across sources of datasets ASAP ~~0424~~ 0428
[x] @dcwalk and @patcon proposed model for downloading TUES 0425
[x] @dcwalk audit and check-in hanging uncrawlables in archivers.space 1.0 TUES 0425
[x] @b5 add remaining uncrawlables to archivers.space for EPA based on reconciliation across datasets FRI 0428
[x] @dcwalk check IA for ftp crawls FRI 0428

And process on this:

TUES Toronto we’ll be working in the archivers.space app to research/harvest the list of datasets
Guide upcoming datarescues to do the same, and reach out to more groups to do so
Add more uncrawlables to archivers.space for EPA from the reconciliation across dataset tracking

dcwalk commented 7 years ago

Other things to check:

EDG FTP against Internet Archive Collections
IA API: https://archive.org/wayback/available?url=www.example.com
Developer Central: https://developer.epa.gov/
Open Data: http://opendata.epa.gov/

b5 commented 7 years ago

Ok notes regarding above:

datarefuge.org & archivers 1.0 should be properly linked, I'll do some work in the coming days to confirm, but the list in archivers 1.0 should be more usable, as it also contains links marked "crawlable", and connections to final s3 urls.
We're now crawling edg.epa.gov, will report back soon
What is the "Internal EDGI EPA tracking sheet"? Can someone DM me on slack?

Stats

Some quick numbers from what we currently have, just to start to get a feel for where things are at:

chrome extension

unique uncrawlable epa.gov urls: 7 286

archivers 1.0 epa.gov urls:

total: 252
crawlable: 66
research: 78
harvest: 69
bag: 3
done: 34

archivers 2.0 epa.gov urls:

total known urls: 616 941
total that have been checked with a HEAD request: 239 869
total that have been checked with HEAD and GET requests: 38 725

Next Steps:

If you have a list of urls, I'd love to see them!
there are 78 epa.gov urls in the research phase, and 69 in the harvest in archivers 1.0, I think the best place to start for volunteers would be to go through these two lists within archivers 1.0 looking for low-hanging fruit (things that can be marked crawlable, easily harvested, etc)
I'll get a quick new flow up in archivers 2.0 for listing chrome extension uncrawlables for research & sending to archivers 1.0. This'll start to cut way down on the 7 286 uncrawlable number. Many of those must in fact be crawl able. look for this to land tomorrow sometime.

murphyofglad commented 7 years ago

For lists of EPA datasets that exist, there is:

the json.ld file from the EPA. Even if links are dead it is a comprehensive list
Max ogden's list, the parsed list from the environment data gateway json.ld file
The spreadsheet @mjanz is working on based on Ogden's list

b5 commented 7 years ago

So over the last 12 hours or so the lists of urls to examine have kept pouring in, and the number of deduplicated urls to examine remains in the thousands. Sorting, prioritizing, and checking this list is going to take some time, as I don't want to just hand over a list of thousands of urls without engaging in at least some grouping, checking & deduplicating before passing off for fine-grained examination. The best place to meet in relation to confirmed work that needs archiving is still archivers 1.0, which Margaret has been doing an incredible job of building on based on Max's list. Any url that we can't account for will pass through there at some point.

I'm going to put time into collating the much larger list of raw urls into sensible groupings before passing off for consideration. This is going to take until Friday to assemble. I'll keep posting updates here as I go.

b5 commented 7 years ago

Ok, so I really need to double check this work, but I have some preliminary results:

TL;DR; We've archived 37% of the EPA datasets we know about

total urls: 18476
total archived: 6842
gist of results

I'm planning on doing a deep dive on this in the near future, as there are lots & lots of insights to talk about.

Initial observations:

The list of JSON-LD links @maxogden provided us contains a ton of previously unidentified data
This might seem disheartening, but a lot of this list can be archived through automation
Doing this audit has been incredibly insightful. I think it has potential to be a really helpful tool for showing off the work of the community and other collaborating organizations.

Immediate Questions:

@trinberg: after filtering down the nomination tool output for EPA pages marked as "uncrawlable", there were a whopping 67 EPA urls marked uncrawlable. Something tells me this number is off ;) If you have a minute I'd like to chat about that figure & make sure it's accurate.

Next Steps:

Point the archivers 2.0 crawler at the JSON-LD urls list. A lot of those links are dead / useless based on some initial spot-checking.
Examine the list of non-404 urls reported post-crawl, confirm those are in archivers.space
Build a visualization of this completion, with archiving projects marked That json list is complete, but it's huge and difficult to understand, a visualization will really help us guide our prioritization.

murphyofglad commented 7 years ago

@b5 does it make sense to reach back to Max Ogden with these findings and see what of this list the Dat Project already has captured? That could cut it down list further.

b5 commented 7 years ago

If I'm not mistaken the Dat project isn't directly engaged in archiving work, but I'm going to be in Portland for csvconf on Monday, I'll try to sync up with the Dat folks IRL.

Either way we'll start into this list over the weekend and see what shakes out.

titaniumbones commented 7 years ago

@b5 We're (@patcon @dcwalk @shaqsingh) all just sitting here thinking about EPA and I'm looking at your old next-steps list from a month ago:

[ ] Point the archivers 2.0 crawler at the JSON-LD urls list. A lot of those links are dead / useless based on some initial spot-checking.
[ ] Examine the list of non-404 urls reported post-crawl, confirm those are in archivers.space
[ ] Build a visualization of this completion, with archiving projects marked. That json list is complete, but it's huge and difficult to understand, a visualization will really help us guide our prioritization.

In partocular, wondering if you've done the first step -- if not, how hard would it be to move on that? e.g., is it something I can do? (is there a way for a user to submit a list of URL's, for instance?)

b5 commented 7 years ago

just a quick update on this, I've been working on the JSON-LD crawl, should have more to report on this by the end of the week. I haven't forgotten, and will report back here!

dcwalk commented 7 years ago

@b5 provided an update in Slack --

Crawling JSON-LD and review of post-crawl report hasn't happened. This isn't stalled per se, but the goal is to get Data Together crawler closer to the WARC spec before crawling.

@b5 will get a milestone & issues for WARC parity going, and I will reticket appropriate tasks from this here. ETA AUG 15 evening.

dcwalk commented 7 years ago

Once #199 and #196 are resolved, I think this is good to close. In the interim, moving to the Fall Work Cycle milestone based on our September 11 Archiving call...as this was indicated as an ongoing and important priority.

Frijol commented 5 years ago

Per our new stale issues policy:

This issue has been marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.

In the future, a robot will take care of this process!

Frijol commented 5 years ago

Closing because https://github.com/edgi-govdata-archiving/overview/issues/199 and https://github.com/edgi-govdata-archiving/overview/issues/196 are closed

edgi-govdata-archiving / overview