CodeForPittsburgh / food-access-map-data

Data for the food access map
MIT License
8 stars 18 forks source link

What is the Food Access Map?

This project's goal is to create an internal and public-facing resource (e.g. an interactive map) for people looking to find healthy, affordable food. Pittsburgh Food Policy Council, an umbrella organization for food-related nonprofits, is the project sponsor. More information about the need for this project can be found here.

There are many food-related nonprofits in the Pittsburgh area, and each maintains datasets about different food access programs and where they are offered (for example, Greater Pittsburgh Food Bank maintains a list of food pantries). The data processing part of this project gathers data from various sources and merges the datasets into a common format.

Where is the Map located?

The map is located at the following address: https://codeforpittsburgh.github.io/FoodAccessMap/ Code for the map is located at a different repo: https://github.com/CodeForPittsburgh/CodeForPittsburgh.github.io/tree/master/FoodAccessMap

How does the map work?

The map relies on the following steps to provide results

  1. Raw data is manually gathered from various providers at the federal and local level and saved in the Github repository.
  2. A Github Action is used to kick off a virtual machine containing the various scripts which then clean, transform, deduplicate, and collate the multiple data sources into a single file for use by the map.
  3. The map is hosted on another Code for Pittsburgh Github repo.

How You Can Help

Volunteers can help in a number of ways, including developing code, fixing bugs, and improving project documentation. A list of outstanding issues can be found on the issues page, but if you can't find an issue you think you can work on, don't hesitate to ask one of us for help figuring out how you can contribute!

What Programs You Need Installed

Python: Some of the data processing scripts are written in Python. R: Some of the data processing scripts are written in R.

There are multiple ways to access and manipulate the data, but for simplicity’s sake, this README will recommend a Python or R.

Get the Data

Python

This project uses Python3, pipenv and pytest.

Required packages are listed in Pipfile and can be installed using

$ pipenv install

This installs the packages in a virtual environment, a python convention which allows different projects to have different dependencies, with different versions.

You can run a single command inside the virtual environment using pipenv run, or open a shell using

$ pipenv shell

Tests are stored in the tests/ directory, and include any file in the form test_*.py, you can run them using

$ pytest

When you're done with the virtual environment, you can leave it using

$ exit

R

It is recommended to use the RStudio IDE to interact with the data.

  1. Download/Install R
  2. Download RStudio
  3. Start an RStudio Project (recommended)
  4. Install the tidyverse package with the following line of code (one-time action):

install.packages(“tidyverse”)

  1. Start a new R Script or RMarkdown and read in the data with the following line of code:
    library(tidyverse)
    my_data <- read_csv(“https://raw.githubusercontent.com/CodeForPittsburgh/food-access-map-data/master/merged_datasets.csv”)

  2. Once you’ve entered this line of code, you now have access to the data. You can use the various functions in base R or the tidyverse to explore the data

  3. For example, you can use the command names(my_data) to see the attributes of the data table.

food-access-map-data

Data for the food access map:

The end result of all of these steps is a new merged_datasets.csv, which the map points to for its data!

Data Sources for Food Access Map

Sources are obtained and prepared for additional processing via our data prep scripts. The source rules for utilizing those scripts can be found here.

Data Labels

These labels are listed in merged_datasets.csv and are used to denote particular unique traits of the food source.

Adding new datasets

New datasets can be added as they are discovered. A prep script can be added to this directory.

New datasets need to correspond to the project-wide schema set in schema.xlsx. Cleaned data should be saved in .csv format here.

Any new prep script also needs to be added to the source_r_scripts.R or source_python_scripts.py files in the same prep_source_scripts directory. The source scripts control what prep scripts are run to update the full dataset.

Training the Deduplication Data Sets

Because we are combining multiple data sets, we utilize a deduplication process to identify and resolve possible duplicates.

The "training set", which is used to train the deduplication program in what is and is not likely to be a duplicate, is located here. Adding to the training data primarily consists of added cases of address strings that are duplicates, as well as cases of address strings that aren't. To train new data, you can utilize the IPython Notebook located here.

Extra Resources

For An Introduction to R and RStudio

https://education.rstudio.com/learn/beginner/

Introduction To Github

https://guides.github.com/