UrbanInstitute / education-data-package-r

https://urbaninstitute.github.io/education-data-package-r/
Other
86 stars 11 forks source link

Unexpected API behavior for CCD Enrollment #81

Closed jknowles closed 3 years ago

jknowles commented 3 years ago

I'd like to be able to get the school enrollment by race for all grades for all schools in 2018 (as in the example in the README). When I run the API call like this, it takes a long time, but it summarizes the data as I would expect:

all_schools_enroll <- get_education_data(level = 'schools',
                         source = 'ccd',
                         topic = 'enrollment',
                         by = list('race', 'sex'),
                         filters = list(year = 2018, grade = 'grade-99'),
                         add_labels = FALSE,
                         csv = FALSE)

I'd like to try to speed it up as the documentation recommends by fetching the CSV file first, but when I run:

all_schools_enroll_csv <- get_education_data(level = 'schools',
                         source = 'ccd',
                         topic = 'enrollment',
                         by = list('race', 'sex'),
                         filters = list(year = 2018, grade = 'grade-99'),
                         add_labels = TRUE,
                         csv = TRUE)

But, when I run this, the CSV downloads, but the resulting data frame has 0 rows. I suspect this is because of the grade filter, but I tried various combinations of add_labels = FALSE and different values for grade = XXXX in the filters to no avail.

Thanks for all the great work here - this is not an urgent problem as a workaround exists, but it would be really helpful to speed up national analysis using school or LEA level results.

khueyama commented 3 years ago

@jknowles thank you for the detailed report and sorry to hear you are running into issues. I'll take a look into this and let you know when I've got things back to working as expected.

khueyama commented 3 years ago

@jknowles just pushed an update that should solve this issue. For example:

library(educationdata)

all_schools_enroll <- get_education_data(
  level = 'schools',
  source = 'ccd',
  topic = 'enrollment',
  subtopic = list('race', 'sex'),
  filters = list(year = 1990, grade = 'grade-99', fips = 56),
  add_labels = TRUE,
  csv = FALSE
)
#> 
#> Fetching data for schools/ccd/enrollment/1990/grade-99/race/sex/?fips=56 ...
#> Processing page 2 out of 3
#> Processing page 3 out of 3

all_schools_enroll_csv <- get_education_data(
  level = 'schools',
  source = 'ccd',
  topic = 'enrollment',
  subtopic = list('race', 'sex'),
  filters = list(year = 1990, grade = 'grade-99', fips = 56),
  add_labels = TRUE,
  csv = TRUE
)
#> 
#> Fetching data for schools_ccd_enrollment_1990.csv ...

dim(all_schools_enroll)
#> [1] 2490    9
dim(all_schools_enroll_csv)
#> [1] 2490    9

Now returns data.frames of the same size. Thanks again for the helpful bug report, and let me know if you run into any other issues.

jknowles commented 3 years ago

Hey this worked great. Is the switch from by = c() to subtopic = c() for filtering going to be permanent moving forward (e.g. for a CRAN release)? I'm working on some code that I'd like to put in production and want to make sure I'm using the right function calls moving forward.

Great work and thanks for the fix!

khueyama commented 3 years ago

Yes, please use subtopic going forward for production code! We realized the by terminology was confusing, especially with the summary endpoint functionality we're building out, so I've soft-deprecated it for now.