UrbanInstitute / education-data-package-r

https://urbaninstitute.github.io/education-data-package-r/
Other
86 stars 11 forks source link

Error in encoding when adding labels to CCD directory data #91

Closed jknowles closed 1 year ago

jknowles commented 2 years ago

When the labels are added to the FIPS variable for the CCD directory data, an encoding error occurs in the label on my machine. Quick reproducible example below:

dir_data <- get_education_data(
  level = 'schools', 
  source = 'ccd',
  topic = 'directory',
  filters = list(fips = 78, year = 2018),
  add_labels = TRUE,
  csv = FALSE
)

table(as.character(dir_data$fips))

This results in:

Virgin Islands of the US 
                       28

Which should be: Virgin Islands of the US

My locale information is below:


 setting  value
 version  R version 4.1.0 (2021-05-18)
 os       Windows 10 x64 (build 19044)
 system   x86_64, mingw32
 ui       RStudio
 language (EN)
 collate  English_United States.1252
 ctype    English_United States.1252
 tz       America/New_York
 date     2021-12-09
 rstudio  2021.09.0+351 Ghost Orchid (desktop)
 pandoc   NA
erika-tyagi commented 2 years ago

Hi @jknowles – apologies for the delay! I'm not able to reproduce this issue. Could you confirm if you still see the same encoding issue with the example you provided? And if so, could you see if you run into the same issue with the following:

  1. Flipping the csv flag:
    dir_data_from_csv <- get_education_data(
    level = 'schools', 
    source = 'ccd',
    topic = 'directory',
    filters = list(fips = 78, year = 2018),
    add_labels = TRUE,
    csv = TRUE
    )
  2. Using a different endpoint (e.g. IPEDS directory):
    ipeds_dir_data <- get_education_data(
    level = 'college-university', 
    source = 'ipeds',
    topic = 'directory',
    filters = list(fips = 78, year = 2018),
    add_labels = TRUE,
    csv = FALSE
    )
jknowles commented 2 years ago

Thank you - no problem with the delay.

I am still seeing the encoding issue when I run the code. If you aren't seeing it perhaps it is a difference in the base encoding of our R installations and how that is interacting. I found the same issue with the csv flag and the IPED endpoint.

Perhaps it comes down to how the data from the API is interacting with the LOCALE installed in R. FWIW, this is what my R locale currently is:

> Sys.getlocale()
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
erika-tyagi commented 2 years ago

Hi @jknowles – I believe we tracked down and resolved this issue in our latest data release (which went live yesterday). Could you try again and see if this is resolved on your end? And thanks for your patience!

jknowles commented 1 year ago

I can confirm this is working correctly now - thank you @erika-tyagi