EDGI / End of Term Project Group

DGdodston commented 7 years ago

About the group

This Chi Hack Night breakout group collaborates with #DataRefuge, Environmental Data & Governance Initiative (EDGI), and the Internet Archive’s End of Term 2016 project to archive the federal online pages and data that are in danger of disappearing during the Trump administration. This breakout group is focused on preserving information and data from agencies agencies have programs and data at high risk of being removed from online public access or even deleted. This project is urgent because the Trump transition team has identified these environmental programs as priorities for cutting or change.

Upcoming EDGI / End of Term Project Group breakout sessions during Chi Hack Night: 1/17, other TBA dates

More info:

End of Term Presidential Harvest 2016 (University of North Texas): http://digital2.library.unt.edu/nomination/eth2016
End of Term Web Archive: http://eotarchive.cdlib.org
Environmental Data & Governance Initiative: https://envirodatagov.org
"Preserving U.S. Government Websites and Data as the Obama Term Ends" (Internet Archive): http://ow.ly/vEow307MMoQ
Tools for Government Data Archiving: https://github.com/edgi-govdata-archiving

Group leaders: Karl-Rainer Blumenthal, Dan Godston

Stay tuned for updates!

kblumenthal commented 7 years ago

Thanks for getting the ball rolling, Dan! Here are more-more resources for those interested in the work we can do together :-)

End of Term 2016's front end for background info and nominating URLs to be crawled and preserved: http://digital2.library.unt.edu/nomination/eth2016/
DataRescuePhilly, a complementary (but bigger!) effort that will commence at UPenn, January 13/14: http://www.ppehlab.org/datarefugephilly/

I got to speak with the Philly folks a little bit the other day, and they will catch us up to speed with their work--the results as well as the event resources--after their event, and would likewise like to help us promote our effort through their blog if and when we can give them a short description of our group/event. Let's everyone interested meet as a breakout group at the 1/10 hack night in order to strategize, identify the kinds of data we have the skills to rescue, and plan out a super-productive 1/17 event open to all that can help!

DGdodston commented 7 years ago

Karl,

Thanks for the providing the additional resources, and for leading tomorrow's breakout session!

Dan

dcwalk commented 7 years ago

Hey @dcwalk from Toronto chiming in! I'm involved with the Tools for Government Archiving https://github.com/edgi-govdata-archiving which came out of the Dec 17 event.

We are going to be sprinting on our docs on Jan 10 at the same time during the Toronto Civic Tech night (https://www.meetup.com/Civic-Tech-Toronto/events/236242168/).

kblumenthal commented 7 years ago

Woo! Thanks, @dcwalk! We'll be sure to check the docs after your meetup. That's good timing.

We'll have less uninterrupted time for the hack element of our own effort than Toronto or Philly, I assume, so most of all I want to make sure that we avoid spending ours on redundant work. Anything you can do to catch us up on domains, datasets, or data types that have or haven't been addressed in Toronto would be very helpful.

codersquid commented 7 years ago

Is there a civic tech group in the SF/Bay area who is working on this in person? I'd like to share the information with someone who is interested.

kblumenthal commented 7 years ago

Not that I've heard of yet, @codersquid, but I'll ask folks at the Internet Archive and report back if they know any better. I bet that a few plucky folks could take the info above to quickly put one together, though ;-)

dcwalk commented 7 years ago

Just chiming in @codersquid the EDGI people (I think the same at @kblumenthal mentioned) has been pairing up interested folks, but don't have a public list of 'in progress' events.

Ones that are scheduled are here: https://envirodatagov.org/events/

dcwalk commented 7 years ago

RE: not duplicating efforts...

At the Dec 17 event we focused on EPA, while a good chunk was done (much directly nominated into the End of Term Internet Archive; and some custom tools developed to crawl/scrape stuff that wouldn't to be picked up), there are still some EPA gaps, mostly the regional offices. The #datarefuge event is focusing on another agency, NOAA, so it might make sense to not tackles that either, but instead pick a well-scoped section of an agency to tackle.

There are a couple strategies:

Tackle an agency/program based upon an Agency Forecast and aim to directly seed using the eth16 bookmarklet
Look at a regional office and aim to nominate more seeds using the bookmarklet
Address one particular dataset that would prove challenging to crawl by the Internet Archive, this primer on How the Internet Archive works and the whole event toolkit should help. As well as the tools we have in our github. This would take the most time to get up and running on, so if time limitations are a factor, bear that in mind

Sorry for the delay in this response, was just sorting out a good approach to suggest @kblumenthal. Ping me if you'd like to discuss more.

kblumenthal commented 7 years ago

This is perfect for our needs tonight, @dcwalk -- thank you so much! Hopefully we can get grounded and started with parts 1 and 2 above tonight, and identify the resources with further needs for focused work in part 2 (next week).

bkirkbri commented 7 years ago

For anyone interested in mirroring climate datasets:

https://climatemirror.org Spreadsheet of resources requested for mirroring

kblumenthal commented 7 years ago

A small group of volunteers met at the January 10 Chi Hack Night to review the progress made and/or planned by data rescuers in Toronto and Philadelphia, identify potential sources of at-risk data, and the tools available to collect and preserve them. From this initial conversation we outlined the following objectives for our January 17 mini-hackathon (according to my notes -- please add/update/edit!):

1. Seed the End of Term 2016 project The first objective of the night is to provide as many URLs for web-based and at-risk resources as we can identify to the End of Term (EOT) 2016 project. These “nominated” URLs can be crawled and their archival derivatives stored by the Internet Archive. The archived URLs can thereafter be accessed through the Wayback Machine and/or the EOT project’s web archive.

To avoid redundancy with complementary efforts, we will 1) focus on federal resources outside of the EPA.gov and NOAA.gov host domains, though Chicago-regional offices of the same may be considered within our scope, and 2) begin our listing from the foundation that @bkirkbri lays with the spreadsheet directly above and his work for the Natural Resources Defense Council.
To identify further URLs to be archived, we will perform host-level queries for climate data and reporting keywords across domains belonging to federal agencies, congressional committees, national laboratories, etc., and scrape the cited URLs in scientific journals and federal grants databases.

Our resulting “seed list” of URLs will also be made publicly available as a read-only spreadsheet for future events to acquire and build upon.

2. Export / Hack / Ask In the course of the above seeding, we are likely to encounter the kinds of interactive data portals that are difficult to archive with automated web crawlers. In addition to adding these resources’ URLs to the seed list above, we will flag them for…

Export: When these portals’ underlying data can be generated through a web browser into formats readable offline, we will upload the resulting files directly to the Internet Archive’s collections.
Hack: When no direct access to offline-readable data is enabled, we will investigate existing (https://github.com/edgi-govdata-archiving/) or experiment with new tools for extracting them.
Ask: When all else fails, just ask for it :-) We will identify data sets that may be retrievable directly from the federal owner/sponsor by way of FOIA request.

With the boost we've gotten from prior efforts, the tools already in place, and the ongoing work to create some new ones, I think we're in great shape to make a short but meaningful contribution on Tuesday!

DGdodston commented 7 years ago

Hi all,

Glad to hear about these updates. Karl, thanks for leading Tuesday's session!

Looking forward to next Tuesday,

Dan

dcwalk commented 7 years ago

Hey! @kblumenthal Are you all set for tonight? We've had some changes to our workflow after the last week and wanted to check in that you all had what you needed :)

kblumenthal commented 7 years ago

I think so, @dcwalk! I've been catching up with the folks at Penn and Harvard today, so I think I can steer folks in the right directions tonight. Literally listening to the Rocky theme as I prepare...

Look forward to a summary of our work on this and as many other channels as I can find soon!

DGdodston commented 7 years ago

@kblumenthal & @dcwalk -- thanks for your expertise & leadership with this

kblumenthal commented 7 years ago

Links to keep handy tonight:

Seed list(s): https://docs.google.com/spreadsheets/d/1IZdZZTKt1ZO6PmSPiNsz-n__kJ2d8JhTdKD9KGts06k/edit?usp=sharing - temporarily read-write, but assiduously backed- up ;-) Webrecorder: https://webrecorder.io/datarescuechi Other archiving tools from EDGI: https://github.com/edgi-govdata-archiving Internet Archive: https://archive.org/details/@datarescuechicago

bkirkbri commented 7 years ago

Documentation for scrapy spider: https://doc.scrapy.org/en/1.3/intro/tutorial.html

derekeder commented 7 years ago

closing for now.

chihacknight / breakout-groups

EDGI / End of Term Project Group #69

About the group

DataRescuePhilly, a complementary (but bigger!) effort that will commence at UPenn, January 13/14: http://www.ppehlab.org/datarefugephilly/