The project is a part of Smith College Statistical and Data Science Capstone in Spring 2023, which is kindly sponsored by Dance Data Projectยฎ โDance Data Projectโ, a non-profit organization advocating for girls and women in dance. The project aims to look at the longitudinal record of dance company endowments before and after pandemic and analyze the their performances. Particularly, we looked at if there is any noticeable pattern and discrepancies exist in their usage of endowment over time. The repository contains open-access data bytes in html and pdf format that present our analyses.
Contributions | Name (alpha order) |
---|---|
๐ค ๐ข ๐ป | Ruth Button |
๐ป ๐ ๐ข ๐ค ๐ | Rose Evard |
๐ฃ ๐ค ๐ | Andrew Hoekstra |
๐ข ๐ป ๐ค๐ | Zhen Nie |
๐ฃ ๐ข ๐ป ๐ค ๐ | Quinn White |
๐ผ ๐ค ๐ | Elizabeth Yntema |
(For a key to the contribution emoji or more info on this format, check out โAll Contributors.โ)
This code is written for the R programming language (4.2.1) and RStudio. Ensuring the most recent version of both R and RStudio is essential. Any operating system compatible with R and RStudio will work. The necessary packages to install are broom
, tidyverse
, xml2
, kableExtra
, here
, plotly
, scales
, readxl
, purrr
, and shiny
. Running INSTALL_ALL.R
will load all dependent packages.
Before running these analyses, we obtained a set of xml files corresponding to companies of interest, where these xml files contain 990 form data in the format reported by the IRS. All R packages needed are installed using INSTALL_ALL.R
.
The script RUN_ALL.R
runs all files in the infrastructure_rmds
directory as well as the exploration_rmds
directory. Html outputs are placed in the output_html
subdirectories of infrastructure_rmds
and exploration_rmds
.
companies.csv
that maps the EIN to the company name for all companies tracked by the DDP. This provides a stable name for each EIN, since companies may change their names slightly or have small variations in the format of their reported business name in the xml files (e.g., differences in capitalization). Are any data processes automated? If so how often is the data updated? If the data needs to updated manually, how would someone go about doing that?
This repo contains all code created by Smith SDS Capstone `23 students for Dance Data Project. There are two main files containing rmarkdowns utilized for analyses:
infrastructure_rmds
: contains all wrangling, troubleshooting, dictionary, and testing code. The most important files within this are load_wrangle_filter.Rmd
, which establishes the base datasets and filters, and handle_discrepancies.Rmd
, which identifies reported discrepancies within Form 990โs and produces more flexible dataset with endowment information from Schedule D. explorations_rmds
: all analyses, including examinations on labor, endowment balances, compensation, and location. All knitted HTML files from rmarkdowns are within a nested folder called output_html
in the respective parent folder.
R scripts with universal functions (GET_VARS.R
, INSTALL_ALL.R
, RUN_ALL.R
) are within the main directory.
Original data utilized for this project are not contained within this repo. However, all data produced by infrastructure rmarkdowns are saved in a folder called data
in .RDS
form. All analyses assume your data are stored in XML format, in a folder called ballet_990_released_20230208
The folder css
contains css code to produce standardized knitted HTMLs.
This work is licenced under an MIT license.
Questions, bug reports, and feature requests can be submitted to this repo's issue queue.
Contact Andrew Hoekstra here.