Data Objects - Githubissues

olbeck commented 1 year ago

Which data objects do we need for the package to work?

adapted NTEE codes
adapted geographic codes
EIN-NTEE crosswalk (not in the efile data)

Store raw data with datasets.R in data_raw folder (for example, census regions used to generate geo distance, original NTEE codes, etc).

Save package objects in data folder as rda objects.

ntee.rda
geo.rda
nteecrosswalk.rda

Add roxygen details for each required object to the data.R script in the main R folder. example

Use dashes for folder and filenames but not for R object names.

save( date.words, file="../data/date-words.rda" )

Note that when date-words.rda is loaded it will load the objects contained within, date.words not date-words. Prevents having to use ticks when referencing object names:

date.words
# vs
`data-words`

olbeck commented 1 year ago

Listing data objects that need to be stored and where

Data Object Name	storage-location/file-name.rda	data-raw/file-name.R	Description	Original Source
`nonprofits`	data/nonprofits.rda	data-raw/making-nonprofits.R	table of all nonprofits available for comparison. Rows are a unique EIN, columns are characteristics of that organization that would be useful to the user. See R/data.R for details on characteristics included.
`EIN.filtering`	? (does it need to be in sysdata? or just in data/ ? )	data-raw/making-nonprofits.R	table of all nonprofits available for comparison, each row is a unique EIN, columns are characteristics of that organization that are helpful in filtering and distance calculation. Can be matched to `nonprofits` through EIN. See R/data.R for details on characteristics included. This table essentially gives the crosswalk between the only NTEE codes and how we are choosing to categorize mission.
`state.dist.mat`	data/state-dist-matrix.rda	data-raw/making-state-distance-matrix.R	52-by-52 matrix of distances between every pair of states + DC + PR. Equivalently, this is the number of state you would need to drive through to get to state A to state B (where Alaska is connected to Washington, Hawaii is connected to California, and Puerto Rico is connected to Florida).	from `state.borders`
`state.abb52`	data/state-abb.rda	data-raw/making-state-abb.R	52 state two-letter abbreviations. Not named `state.abb` as this is a data set in the `datasets` package.	`datasets::state.abb`
`state.borders`	data-raw/state-borders.rda	data-raw/making-state-borders.R	Table of state information - state name, bordering states, number of bordering states, US Census regions, and 2 letter state abbreviation.	https://thefactfile.org/u-s-states-and-their-border-states/
`ntee.crosswalk`	data/ntee-crosswalk.rda	data-raw/making-ntee-crosswalk.R	Crosswalk between old NTEE codes and new codes used for distance. Currently only has 1 letter + 2 digits working. Will eventually need to be updated to have the 1 letter + 4 digits too.	https://nccs.urban.org/project/national-taxonomy-exempt-entities-ntee-codes
`nee.orig`	data/ntee-orig.rda	data-raw/getting-ntee-original.R	Table from https://nccs.urban.org/publication/irs-activity-codes

olbeck commented 1 year ago

Do we need a table of geographic codes if we are doing the state distances as a network distance? I have the census regions stored in the data-raw/making-state-distance-matrix.R file. I just don't save them in the final data/state_dist_matrix.rda since they don't seem relevant for the way we have decided to calculated geographic distance.

olbeck commented 1 year ago

For the EIN - NTEE crosswalk, is the EIN.filtering table enough? Or do we want to create a separate table for this with just the NTEE and our recoded mission values? Currently EIN.filtering contains this as well as location and total expense information.

Alternatively, I could rename EIN.filtering to something like EIN.NTEE.crosswalk to be more explicit to the user about what this data file is. Then store it in the data/ folder instead of inside the sysdata.rda, as it would not longer just be internal.

olbeck commented 1 year ago

nonprofits is in a usable state at the moment, but should be more meticulously looked at before we publish to ensure it is in exactly the format we want, does not have repeat organizations, does not include transition years, and includes/excludes the characteristics we want the user to see.

This data only goes up to 2019. Could we get more recent data before we publish? Or is that too big of a task?

lecy commented 1 year ago

Do we need a table of geographic codes if we are doing the state distances as a network distance? I have the census regions stored in the data-raw/making-state-distance-matrix.R file. I just don't save them in the final data/state_dist_matrix.rda since they don't seem relevant for the way we have decided to calculated geographic distance.

By adapted geographic codes I meant documentation of how we create geographic distances, so the state distance matrix is the only thing we would need.

Make sure the process is reproducible, though. Do we have the script that starts with the raw list of states and outputs the distance matrix? The raw state data file and script should be available.

One approach is add them to the data-raw folder and add a README that describes the files and scripts in the folder, as well as workflow (this script uses this raw data file, and after processing produces RDA file "filename.rda" that is used for purpose ... and saved in the main package data folder).

Another approach is to have one RMD file that does all of the raw data processing. This approach allows you to update all files with a single script and also document them directly in the RMD.

Either way, follow the "hit by a bus" principle. If the package creator disappears from the team would someone else have everything they need to maintain/extend the package? Raw files, scripts, instructions, etc.

lecy commented 1 year ago

For the EIN - NTEE crosswalk, is the EIN.filtering table enough? Or do we want to create a separate table for this with just the NTEE and our recoded mission values?

The filtering table is enough for the package code, but insufficient for making the process reproducible. Specifically, how were the new NTEE codes created?

I also suspect that the current data does not contain all possible NTEE codes, so it would not be sufficient to maintain the package moving forward.

The new NTEE format is useful enough that I want to make a separate repository. I'll set one up that can be simple - just a raw NTEE file, a script to reformat the single NTEE code into the groups we have defined, and the generated crosswalk file. Does that make sense?

lecy commented 1 year ago

This data only goes up to 2019. Could we get more recent data before we publish? Or is that too big of a task?

I'm working on this now. Should have something ready soon.

lecy commented 1 year ago

I added your table above to the data-raw/README.md file:

https://github.com/Nonprofit-Open-Data-Collective/compensator/blob/main/data-raw/README.md

olbeck commented 1 year ago

Do we need a table of geographic codes if we are doing the state distances as a network distance? I have the census regions stored in the data-raw/making-state-distance-matrix.R file. I just don't save them in the final data/state_dist_matrix.rda since they don't seem relevant for the way we have decided to calculated geographic distance.

By adapted geographic codes I meant documentation of how we create geographic distances, so the state distance matrix is the only thing we would need.

Make sure the process is reproducible, though. Do we have the script that starts with the raw list of states and outputs the distance matrix? The raw state data file and script should be available.

The geographic distances are now reproducible. See data-raw/making-state-borders.R and data-raw/making-state-distance-matrix.R.

lecy commented 1 year ago

excellent!

olbeck commented 1 year ago

For the EIN - NTEE crosswalk, is the EIN.filtering table enough? Or do we want to create a separate table for this with just the NTEE and our recoded mission values?

The filtering table is enough for the package code, but insufficient for making the process reproducible. Specifically, how were the new NTEE codes created?

I also suspect that the current data does not contain all possible NTEE codes, so it would not be sufficient to maintain the package moving forward.

The new NTEE format is useful enough that I want to make a separate repository. I'll set one up that can be simple - just a raw NTEE file, a script to reformat the single NTEE code into the groups we have defined, and the generated crosswalk file. Does that make sense?

Creating the cross walk between the new ntee codes and the disaggregated version is in https://github.com/Nonprofit-Open-Data-Collective/mission-taxonomies/blob/main/NTEE-disaggregated/ntee-crosswalk.rda. Then this crosswalks is also stored in the compensator package.

olbeck commented 1 year ago

Documentation in R/data.R is not currently up to date. Need to update:

[ ] nonprofits
[ ] EIN.filtering
[x] state.abb52
[x] state.dist.mat
[ ] ntee.crosswalk

Nonprofit-Open-Data-Collective / compensator

Data Objects #5