DS4PS / course_website

https://ds4ps.github.io/course_website/
0 stars 0 forks source link

Lab 10 -- Not getting the correct summary statistics for Part 1, #5 #14

Open swkilar opened 6 years ago

swkilar commented 6 years ago

I thought I was on the right track and that I merged the data correctly, but I am not getting the same summary statistics as what's shown in the lab. I may have gone wrong in the original counting (part 2) or I may be mutating the density figures into the wrong table (I mutated into my merged data, but from the lab instructions it looks like Dr. Lecy mutated them into separate 2010 and 200 tables). Can anyone see what I'm doing wrong in my code, below? Thank you!

Part 1

1) Combine the 2000 and 2010 nonprofit data using the row binding rbind() function.

np.dat <- rbind( np.dat.2000, np.dat.2010 )

2) Count the nonprofits in each county for each year. Use “FIPS” for the county code and “YEAR” for the tax year.

np.dat %>%
  filter( YEAR == 2000 ) %>%
  group_by( FIPS ) %>%
  count (n_distinct(EIN)) %>%
  arrange( -n )
#There must be an easier way to do this than using n_distinct.

np.dat %>%
  filter( YEAR == 2010 ) %>%
  group_by( FIPS ) %>%
  count (n_distinct(EIN)) %>%
  arrange( -n )

3) Add county population to this dataset by merging the Census data with the nonprofit data.

merged.data <- merge( np.dat, pop, by.x = c( "FIPS","YEAR" ), by.y = c( "fips", "year" ), all.x = TRUE)

4) Calculate the nonprofit density per county using the following formula: Density = Number of nonprofits/ (population/100,000)

number.per.county.2000 <- 
merged.data %>%
  filter( YEAR == 2000 ) %>%
  group_by( FIPS ) %>%
  count (n_distinct(EIN))

number.per.county.2010 <- 
merged.data %>%
  filter( YEAR == 2010 ) %>%
  group_by( FIPS ) %>%
  count (n_distinct(EIN))

merged.data <- mutate (merged.data, density.2000= (number.per.county.2000$n / (merged.data$population/100000)))

merged.data <- mutate (merged.data, density.2010=(number.per.county.2010$n / (merged.data$population/10000)))

5) Report the summary statistics for the 2000 density measures.

summary( merged.data$density.2000) %>% pander()
summary( merged.data$density.2010) %>% pander()
lecy commented 6 years ago

You are on the right track, you just switched two operations.

You merged the data in Step 03, then counted nonprofits in Step 04. You were supposed to count nonprofits in Step 02, then merge population to the new counts dataset in Step 03.

Because you switched the order, when you calculate the density in Step 04 you are running into a serious problem:

mutate( merged.data, density.2000= (number.per.county.2000$n / (merged.data$population/100000)) )
# BUT:
length( number.per.county.2000$n )
[1] 3049
length( merged.data$population/100000 )
[1] 619152

So you are dividing a counts vector of length 3,049 by a population vector of length 619,152. R will just use recycling to complete the calculation:

c(1,2,3,4,5,6) / c(1,2)
[1] 1 1 3 2 5 3
# above is same as c(1,2,3,4,5,6) / c(1,2,1,2,1,2)

#[1,]    1    1
#[2,]    2    2
#[3,]    3    1
#[4,]    4    2
#[5,]    5    1
#[6,]    6    2

As a result, there is misalignment of rows in your calculation. The fix is easy, though. You just need to save your count data from Step 02 as a new dataset. Note that it contains the FIPS code, and your population data contains a FIPS code, so you can not add population to the counts dataset.

Also, you can group_by( FIPS, YEAR ) and do all of the calculations at the same time. If you group_by() %>% count() then you don't need to use distinct_n().

Make sense?

swkilar commented 6 years ago

Success! Thank you.