chihacknight / breakout-groups

Breakout groups that meet at Chi Hack Night every Tuesday in Chicago
https://chihacknight.org/breakouts.html
95 stars 24 forks source link

Zombie Datasets #129

Closed csethna closed 6 years ago

csethna commented 6 years ago

Happy Halloween. Welcome to Zombie Datasets.

The goal of this breakout group is to identify “walking dead” datasets, defined as: “publicly available datasets which have been accessed least recently with relation to their peer datasets from the same source.”

In this exercise, we will be playing Dr. Robert Neville and treating these datasets as hosts infected with the Krippin Virus. This means, unlike zombies from the Robert Kirkman (Walking Dead) or Max Brooks (World War Z) universes, these “zombie” datasets are capable of being rehabilitated, with the right use case.

Upon identification of a zombie dataset, the team will work quickly to administer GA-series Serum 391, Compound 6. This means devising a use case for which the dataset could provide meaningful insights and, time allowing, creating MVP in order to demonstrate a potential application for the data.

Group leaders

Tools

GitHub

Chi Hack Night slack channel

csethna commented 6 years ago

Example Problem: The Library

Discuss: Why is Chicago Public Library data so unpopular?

csethna commented 6 years ago

Findings

In this one-time edition of Zombie Datasets, we examined two under utilized City of Chicago datasets.

Datasets examined

  1. Chicago Public Library data
  2. Chicago Park District - summer movies data

    Summary

    We determined that Chicago Public Library data is underutilized, not because it isn't useful, but because the frequency of updates-- monthly. We also determined that the dataset of movies is very small each year and due to the availability of the movies from other outlets, the use case for individual datasets could be perceived as limited.

    Use Cases

    Library
    • Looking at frequency of computer sessions per branch could determine which branches get more traffic, but is also influenced by other factors such as location, square footage, and availability of public access terminals. A more useful way of examining this data is aggregating it over the course of several years and determining the statistically "busiest" months of the year for computer use. Fields exist to enable determination of peak computer use times by branch. This information is relevant in determining whether or not additional funding should be allocated for more terminals (seeing average capacity compared against maximum capacity). It could also be useful in determining when are the best/ worst times to update software or hardware which might inconvenience the fewest number of patrons.
    • Other interesting data available are the number of holds placed per branch and the number of holds fulfilled per branch. The availability of this data allows us to determine which branches have the highest "flake out" rate among patrons. Using GIS, it is possible to search for a geographic pattern in regards to flakiness, though other factors could contribute to the phenomenon.
      Movies
    • A "what to do in the summer" Twitter bot that pulls movie titles, ratings, and locations from the dataset.
    • An analysis of repeated titles over-time. What films to Park District employees most enjoy showing?
    • Which parks have shows the most films?
    • Is there a geographic bias to which parks are selected for film screenings?

      Conclusion

      The City of Chicago has considerable "zombie data." However, our group was able to determine that even the most underutilized data is not completely useless. In the future, "rehabilitation" of these lesser-accessed datasets could be an interesting project which raises awareness of the breadth of information available on the data portal and showcases the powerful applications and inferences able to be drawn from the application of analysis.

stevevance commented 6 years ago

The way I think of zombie datasets is looking at which datasets haven't been updated in a while, and not which ones were accessed the least.

csethna commented 6 years ago

Great idea!

Cyrus Sethna about.csethna.com

On Nov 9, 2017, 3:33 PM -0600, Steven Vance notifications@github.com, wrote:

The way I think of zombie datasets is looking at which datasets haven't been updated in a while, and not which ones were accessed the least. — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or mute the thread.