csethna commented 6 years ago

Happy Halloween. Welcome to Zombie Datasets.

The goal of this breakout group is to identify “walking dead” datasets, defined as: “publicly available datasets which have been accessed least recently with relation to their peer datasets from the same source.”

In this exercise, we will be playing Dr. Robert Neville and treating these datasets as hosts infected with the Krippin Virus. This means, unlike zombies from the Robert Kirkman (Walking Dead) or Max Brooks (World War Z) universes, these “zombie” datasets are capable of being rehabilitated, with the right use case.

Upon identification of a zombie dataset, the team will work quickly to administer GA-series Serum 391, Compound 6. This means devising a use case for which the dataset could provide meaningful insights and, time allowing, creating MVP in order to demonstrate a potential application for the data.

Group leaders

Cyrus Sethna, @csethna on Slack.
We will be on #zombie-data.

Tools

I'm comfortable in Python, learning Pandas for this project.
If you're comfortable with something else, that's cool too!
If you're completely new to data science, it's okay, I've got about 4 days on you. Pros & more experienced folks welcome!

GitHub

Repo

Chi Hack Night slack channel

Join the Chi Hack Night Slack if you haven't already.

csethna commented 6 years ago

Example Problem: The Library

Discuss: Why is Chicago Public Library data so unpopular?

Libraries - 2017 Holds Filled by Location --> 58 views
Libraries - 2017 Holds Placed by Location --> 72 views
Libraries - 2017 Computer Sessions by Location --> 101 views
Libraries - 2017 Wi Fi Usage --> 102 views
Libraries - 2017 Circulation by Location --> 168 views
Libraries - 2017 Visitors by Location --> 211 views
The #1 dataset has 971,780 all-time views.
Data Science - Can we determine:

Beginner
How many total wi-fi sessions have occurred across all branches so far in 2017?
Where do most patrons go to use the Internet (computer sessions)?
Intermediate
When is the busiest time of year, averaged across all branches?
How many holds were placed but never picked up (phantom holds)?
Which branch has the most phantom holds?
Where do most patrons go to use the Internet (computer sessions), excluding Harold Washington?
Advanced
Based on 2016 wi-fi connection data, how many connections might occur in November & December 2017?
Building upon earlier questions, can we make a visualization of the busiest branches based on aggregate patronage (holds + wi-fi usage + computer sessions + circulation)? This might look like a graph, chart, map, or word cloud, depending on your approach.

csethna commented 6 years ago

Findings

In this one-time edition of Zombie Datasets, we examined two under utilized City of Chicago datasets.

Datasets examined

Chicago Public Library data
Chicago Park District - summer movies data
Summary

We determined that Chicago Public Library data is underutilized, not because it isn't useful, but because the frequency of updates-- monthly. We also determined that the dataset of movies is very small each year and due to the availability of the movies from other outlets, the use case for individual datasets could be perceived as limited.

Use Cases

Library
- Looking at frequency of computer sessions per branch could determine which branches get more traffic, but is also influenced by other factors such as location, square footage, and availability of public access terminals. A more useful way of examining this data is aggregating it over the course of several years and determining the statistically "busiest" months of the year for computer use. Fields exist to enable determination of peak computer use times by branch. This information is relevant in determining whether or not additional funding should be allocated for more terminals (seeing average capacity compared against maximum capacity). It could also be useful in determining when are the best/ worst times to update software or hardware which might inconvenience the fewest number of patrons.
- Other interesting data available are the number of holds placed per branch and the number of holds fulfilled per branch. The availability of this data allows us to determine which branches have the highest "flake out" rate among patrons. Using GIS, it is possible to search for a geographic pattern in regards to flakiness, though other factors could contribute to the phenomenon.
  Movies
- A "what to do in the summer" Twitter bot that pulls movie titles, ratings, and locations from the dataset.
- An analysis of repeated titles over-time. What films to Park District employees most enjoy showing?
- Which parks have shows the most films?
- Is there a geographic bias to which parks are selected for film screenings?
  Conclusion
  
  The City of Chicago has considerable "zombie data." However, our group was able to determine that even the most underutilized data is not completely useless. In the future, "rehabilitation" of these lesser-accessed datasets could be an interesting project which raises awareness of the breadth of information available on the data portal and showcases the powerful applications and inferences able to be drawn from the application of analysis.

stevevance commented 6 years ago

The way I think of zombie datasets is looking at which datasets haven't been updated in a while, and not which ones were accessed the least.

csethna commented 6 years ago

Great idea!

Cyrus Sethna about.csethna.com

On Nov 9, 2017, 3:33 PM -0600, Steven Vance notifications@github.com, wrote:

The way I think of zombie datasets is looking at which datasets haven't been updated in a while, and not which ones were accessed the least. — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or mute the thread.

chihacknight / breakout-groups

Zombie Datasets #129

Happy Halloween. Welcome to Zombie Datasets.

Group leaders

Tools

GitHub

Chi Hack Night slack channel

Example Problem: The Library

Discuss: Why is Chicago Public Library data so unpopular?

Data Science - Can we determine:

Beginner

Intermediate

Advanced

Findings

Datasets examined

Summary

Use Cases

Library

Movies

Conclusion