chihacknight / breakout-groups

Breakout groups that meet at Chi Hack Night every Tuesday in Chicago
https://chihacknight.org/breakouts.html
96 stars 24 forks source link

EDGI / End of Term Project Group #69

Closed DGdodston closed 7 years ago

DGdodston commented 7 years ago

About the group

This Chi Hack Night breakout group collaborates with #DataRefuge, Environmental Data & Governance Initiative (EDGI), and the Internet Archive’s End of Term 2016 project to archive the federal online pages and data that are in danger of disappearing during the Trump administration. This breakout group is focused on preserving information and data from agencies agencies have programs and data at high risk of being removed from online public access or even deleted. This project is urgent because the Trump transition team has identified these environmental programs as priorities for cutting or change.

Upcoming EDGI / End of Term Project Group breakout sessions during Chi Hack Night: 1/17, other TBA dates

More info:

Group leaders: Karl-Rainer Blumenthal, Dan Godston

Stay tuned for updates!

kblumenthal commented 7 years ago

Thanks for getting the ball rolling, Dan! Here are more-more resources for those interested in the work we can do together :-)

I got to speak with the Philly folks a little bit the other day, and they will catch us up to speed with their work--the results as well as the event resources--after their event, and would likewise like to help us promote our effort through their blog if and when we can give them a short description of our group/event. Let's everyone interested meet as a breakout group at the 1/10 hack night in order to strategize, identify the kinds of data we have the skills to rescue, and plan out a super-productive 1/17 event open to all that can help!

DGdodston commented 7 years ago

Karl,

Thanks for the providing the additional resources, and for leading tomorrow's breakout session!

Dan

dcwalk commented 7 years ago

Hey @dcwalk from Toronto chiming in! I'm involved with the Tools for Government Archiving https://github.com/edgi-govdata-archiving which came out of the Dec 17 event.

We are going to be sprinting on our docs on Jan 10 at the same time during the Toronto Civic Tech night (https://www.meetup.com/Civic-Tech-Toronto/events/236242168/).

kblumenthal commented 7 years ago

Woo! Thanks, @dcwalk! We'll be sure to check the docs after your meetup. That's good timing.

We'll have less uninterrupted time for the hack element of our own effort than Toronto or Philly, I assume, so most of all I want to make sure that we avoid spending ours on redundant work. Anything you can do to catch us up on domains, datasets, or data types that have or haven't been addressed in Toronto would be very helpful.

codersquid commented 7 years ago

Is there a civic tech group in the SF/Bay area who is working on this in person? I'd like to share the information with someone who is interested.

kblumenthal commented 7 years ago

Not that I've heard of yet, @codersquid, but I'll ask folks at the Internet Archive and report back if they know any better. I bet that a few plucky folks could take the info above to quickly put one together, though ;-)

dcwalk commented 7 years ago

Just chiming in @codersquid the EDGI people (I think the same at @kblumenthal mentioned) has been pairing up interested folks, but don't have a public list of 'in progress' events.

Ones that are scheduled are here: https://envirodatagov.org/events/

dcwalk commented 7 years ago

RE: not duplicating efforts...

At the Dec 17 event we focused on EPA, while a good chunk was done (much directly nominated into the End of Term Internet Archive; and some custom tools developed to crawl/scrape stuff that wouldn't to be picked up), there are still some EPA gaps, mostly the regional offices. The #datarefuge event is focusing on another agency, NOAA, so it might make sense to not tackles that either, but instead pick a well-scoped section of an agency to tackle.

There are a couple strategies:

  1. Tackle an agency/program based upon an Agency Forecast and aim to directly seed using the eth16 bookmarklet
  2. Look at a regional office and aim to nominate more seeds using the bookmarklet
  3. Address one particular dataset that would prove challenging to crawl by the Internet Archive, this primer on How the Internet Archive works and the whole event toolkit should help. As well as the tools we have in our github. This would take the most time to get up and running on, so if time limitations are a factor, bear that in mind

Sorry for the delay in this response, was just sorting out a good approach to suggest @kblumenthal. Ping me if you'd like to discuss more.

kblumenthal commented 7 years ago

This is perfect for our needs tonight, @dcwalk -- thank you so much! Hopefully we can get grounded and started with parts 1 and 2 above tonight, and identify the resources with further needs for focused work in part 2 (next week).

bkirkbri commented 7 years ago

For anyone interested in mirroring climate datasets:

https://climatemirror.org Spreadsheet of resources requested for mirroring

kblumenthal commented 7 years ago

A small group of volunteers met at the January 10 Chi Hack Night to review the progress made and/or planned by data rescuers in Toronto and Philadelphia, identify potential sources of at-risk data, and the tools available to collect and preserve them. From this initial conversation we outlined the following objectives for our January 17 mini-hackathon (according to my notes -- please add/update/edit!):

1. Seed the End of Term 2016 project The first objective of the night is to provide as many URLs for web-based and at-risk resources as we can identify to the End of Term (EOT) 2016 project. These “nominated” URLs can be crawled and their archival derivatives stored by the Internet Archive. The archived URLs can thereafter be accessed through the Wayback Machine and/or the EOT project’s web archive.

Our resulting “seed list” of URLs will also be made publicly available as a read-only spreadsheet for future events to acquire and build upon.

2. Export / Hack / Ask In the course of the above seeding, we are likely to encounter the kinds of interactive data portals that are difficult to archive with automated web crawlers. In addition to adding these resources’ URLs to the seed list above, we will flag them for…

With the boost we've gotten from prior efforts, the tools already in place, and the ongoing work to create some new ones, I think we're in great shape to make a short but meaningful contribution on Tuesday!

DGdodston commented 7 years ago

Hi all,

Glad to hear about these updates. Karl, thanks for leading Tuesday's session!

Looking forward to next Tuesday,

Dan

dcwalk commented 7 years ago

Hey! @kblumenthal Are you all set for tonight? We've had some changes to our workflow after the last week and wanted to check in that you all had what you needed :)

kblumenthal commented 7 years ago

I think so, @dcwalk! I've been catching up with the folks at Penn and Harvard today, so I think I can steer folks in the right directions tonight. Literally listening to the Rocky theme as I prepare...

Look forward to a summary of our work on this and as many other channels as I can find soon!

DGdodston commented 7 years ago

@kblumenthal & @dcwalk -- thanks for your expertise & leadership with this

kblumenthal commented 7 years ago

Links to keep handy tonight:

Seed list(s): https://docs.google.com/spreadsheets/d/1IZdZZTKt1ZO6PmSPiNsz-n__kJ2d8JhTdKD9KGts06k/edit?usp=sharing - temporarily read-write, but assiduously backed- up ;-) Webrecorder: https://webrecorder.io/datarescuechi Other archiving tools from EDGI: https://github.com/edgi-govdata-archiving Internet Archive: https://archive.org/details/@datarescuechicago

bkirkbri commented 7 years ago

Documentation for scrapy spider: https://doc.scrapy.org/en/1.3/intro/tutorial.html

derekeder commented 7 years ago

closing for now.