This project's goal is to create an internal and public-facing resource (e.g. an interactive map) for people looking to find healthy, affordable food. Pittsburgh Food Policy Council, an umbrella organization for food-related nonprofits, is the project sponsor. More information about the need for this project can be found here.
There are many food-related nonprofits in the Pittsburgh area, and each maintains datasets about different food access programs and where they are offered (for example, Greater Pittsburgh Food Bank maintains a list of food pantries). The data processing part of this project gathers data from various sources and merges the datasets into a common format.
The map is located at the following address: https://codeforpittsburgh.github.io/FoodAccessMap/ Code for the map is located at a different repo: https://github.com/CodeForPittsburgh/CodeForPittsburgh.github.io/tree/master/FoodAccessMap
The map relies on the following steps to provide results
Volunteers can help in a number of ways, including developing code, fixing bugs, and improving project documentation. A list of outstanding issues can be found on the issues page, but if you can't find an issue you think you can work on, don't hesitate to ask one of us for help figuring out how you can contribute!
Python: Some of the data processing scripts are written in Python. R: Some of the data processing scripts are written in R.
There are multiple ways to access and manipulate the data, but for simplicity’s sake, this README will recommend a Python or R.
This project uses Python3, pipenv and pytest.
Required packages are listed in Pipfile
and can be installed using
$ pipenv install
This installs the packages in a virtual environment, a python convention which allows different projects to have different dependencies, with different versions.
You can run a single command inside the virtual environment using pipenv run
, or open a shell using
$ pipenv shell
Tests are stored in the tests/
directory, and include any file in the form test_*.py
, you can run them using
$ pytest
When you're done with the virtual environment, you can leave it using
$ exit
It is recommended to use the RStudio IDE to interact with the data.
tidyverse
package with the following line of code (one-time action):install.packages(“tidyverse”)
Start a new R Script or RMarkdown and read in the data with the following line of code:
library(tidyverse)
my_data <- read_csv(“https://raw.githubusercontent.com/CodeForPittsburgh/food-access-map-data/master/merged_datasets.csv”)
Once you’ve entered this line of code, you now have access to the data. You can use the various functions in base R or the tidyverse
to explore the data
For example, you can use the command names(my_data)
to see the attributes of the data table.
Data for the food access map:
merged_datasets.csv
is the most current version of compiled PFPC data
To regenerate merged_datasets.csv with new data, run the "Generate Merged Dataset" Github Action. This calls "data_prep_scripts/run.sh", which runs the following scripts in order.
The end result of all of these steps is a new merged_datasets.csv, which the map points to for its data!
Sources are obtained and prepared for additional processing via our data prep scripts. The source rules for utilizing those scripts can be found here.
These labels are listed in merged_datasets.csv and are used to denote particular unique traits of the food source.
New datasets can be added as they are discovered. A prep script can be added to this directory.
New datasets need to correspond to the project-wide schema set in schema.xlsx. Cleaned data should be saved in .csv format here.
Any new prep script also needs to be added to the source_r_scripts.R or source_python_scripts.py files in the same prep_source_scripts directory. The source scripts control what prep scripts are run to update the full dataset.
Because we are combining multiple data sets, we utilize a deduplication process to identify and resolve possible duplicates.
The "training set", which is used to train the deduplication program in what is and is not likely to be a duplicate, is located here. Adding to the training data primarily consists of added cases of address strings that are duplicates, as well as cases of address strings that aren't. To train new data, you can utilize the IPython Notebook located here.
https://education.rstudio.com/learn/beginner/