Chicago / RSocrata

Provides easier interaction with Socrata open data portals http://dev.socrata.com. Users can provide a 'Socrata' data set resource URL, or a 'Socrata' Open Data API (SoDA) web query, or a 'Socrata' "human-friendly" URL, returns an R data frame. Converts dates to 'POSIX' format. Manages throttling by 'Socrata'.
https://CRAN.R-project.org/package=RSocrata
Other
236 stars 84 forks source link

Can read.socrata() return numeric fields as numeric (not character)? #192

Closed Lulliter closed 3 years ago

Lulliter commented 3 years ago

I am working in R with EU cohesion data (specifically datasets 99js-gm52, xns4-t7ym, 3kkx-ekfq )

The read.socrata() functions returns numeric fields as chr in the dataframe. Is it possible to convert them into the correct field type?

tomschenkjr commented 3 years ago

@Lulliter - haven't been able to reproduce this issue. Looking at the first data set, I ran this code:

install.packages("RSocrata")
library(RSocrata)

d <- read.socrata("https://cohesiondata.ec.europa.eu/2014-2020-Finances/ESIF-2014-2020-Finance-Implementation-Details/99js-gm52")

str(d)

Which resulted in the following output:

'data.frame':   50792 obs. of  24 variables:
 $ ms                          : chr  "FR" "IT" "DE" "GR" ...
 $ ms_name                     : chr  "France" "Italy" "Germany" "Greece" ...
 $ cci                         : chr  "2014FR16M2OP011" "2014IT16M2OP001" "2014DE06RDRP012" "2014GR16M1OP001" ...
 $ title                       : chr  "Guyane - ERDF/ESF" "Molise  - ERDF/ESF" "Niedersachsen / Bremen - Rural Development" "Transport Infrastructure Environment and Sustainable Development - GR - ERDF/CF" ...
 $ ver                         : num  2 1.3 1.6 3 2.1 1.2 4 1.2 3 1.2 ...
 $ fund                        : chr  "ERDF" "ERDF" "EAFRD" "ERDF" ...
 $ priority                    : chr  "5" "5" "P4" "12" ...
 $ to                          : chr  "9" "6" "10" "6" ...
 $ to_short                    : chr  "Social Inclusion" "Environment Protection & Resource Efficiency" "Educational & Vocational Training" "Environment Protection & Resource Efficiency" ...
 $ to_long                     : chr  "Promoting social inclusion, combating poverty and any discrimination" "Preserving and protecting the environment and promoting resource efficiency" "Investing in education, training and vocational training for skills and lifelong learning" "Preserving and protecting the environment and promoting resource efficiency" ...
 $ category_of_region          : chr  "Less developed" "Transition" "VOID" "Transition" ...
 $ eafrd_measure               : chr  "VOID" "VOID" "M01" "VOID" ...
 $ measure_short_description   : chr  "VOID" "VOID" "Knowledge" "VOID" ...
 $ eafrd_fa                    : chr  "VOID" "VOID" "FA_P4" "VOID" ...
 $ focus_area_short_description: chr  "VOID" "VOID" "Priority 4" "VOID" ...
 $ eu_amount                   : num  10000000 6883565 38000000 4941487 1932500 ...
 $ national_amount             : num  1800000 6883565 9500000 1235372 644167 ...
 $ total_amount                : num  11800000 13767130 47500000 6176859 2576667 ...
 $ year                        : int  2017 2015 2014 2016 2016 2015 2018 2015 2017 2015 ...
 $ eu_co_financing             : num  84.8 50 80 80 75 ...
 $ total_eligible_cost         : num  5700000 0 0 6444394 742873 ...
 $ total_eligible_expenditure  : num  0 0 0 0 742874 ...
 $ ir_last_version             : num  201801 201601 NA 201701 2016 ...
 $ reference_date              : POSIXct, format: "2017-12-31" "2015-12-31" ...

Comparing that with the metadata, it looks like the correct data types are being used. Can you provide an example of what you're seeing, the version of RSocrata being used, version of R, and operating system?

Lulliter commented 3 years ago

Hello, here are my session info:

> sessioninfo::platform_info()
 version  R version 3.6.1 (2019-07-05)
 os       macOS Mojave 10.14.6        
RSocrata   1.7.10-6 2019-10-23 [1] CRAN (R 3.6.0)

I see what happens: If I run your same command I get the correct data types.

d <- read.socrata("https://cohesiondata.ec.europa.eu/2014-2020-Finances/ESIF-2014-2020-Finance-Implementation-Details/99js-gm52")

str(d)
'data.frame':   50797 obs. of  24 variables:
 $ ms                          : chr  "FR" "IT" "DE" "GR" ...
 $ ms_name                     : chr  "France" "Italy" "Germany" "Greece" ...
 $ cci                         : chr  "2014FR16M2OP011" "2014IT16M2OP001" "2014DE06RDRP012" "2014GR16M1OP001" ...
 $ title                       : chr  "Guyane - ERDF/ESF" "Molise  - ERDF/ESF" "Niedersachsen / Bremen - Rural Development" "Transport Infrastructure Environment and Sustainable Development - GR - ERDF/CF" ...
 $ ver                         : num  2 1.3 1.6 3 2.1 1.2 4 1.2 3 1.2 ...
 $ fund                        : chr  "ERDF" "ERDF" "EAFRD" "ERDF" ...
 $ priority                    : chr  "5" "5" "P4" "12" ...
 $ to                          : chr  "9" "6" "10" "6" ...
 $ to_short                    : chr  "Social Inclusion" "Environment Protection & Resource Efficiency" "Educational & Vocational Training" "Environment Protection & Resource Efficiency" ...
 $ to_long                     : chr  "Promoting social inclusion, combating poverty and any discrimination" "Preserving and protecting the environment and promoting resource efficiency" "Investing in education, training and vocational training for skills and lifelong learning" "Preserving and protecting the environment and promoting resource efficiency" ...
 $ category_of_region          : chr  "Less developed" "Transition" "VOID" "Transition" ...
 $ eafrd_measure               : chr  "VOID" "VOID" "M01" "VOID" ...
 $ measure_short_description   : chr  "VOID" "VOID" "Knowledge" "VOID" ...
 $ eafrd_fa                    : chr  "VOID" "VOID" "FA_P4" "VOID" ...
 $ focus_area_short_description: chr  "VOID" "VOID" "Priority 4" "VOID" ...
 $ eu_amount                   : num  10000000 6883565 38000000 4941487 1932500 ...
 $ national_amount             : num  1800000 6883565 9500000 1235372 644167 ...
 $ total_amount                : num  11800000 13767130 47500000 6176859 2576667 ...
 $ year                        : int  2017 2015 2014 2016 2016 2015 2018 2015 2017 2015 ...
 $ eu_co_financing             : num  84.8 50 80 80 75 ...
 $ total_eligible_cost         : num  5700000 0 0 6444394 742873 ...
 $ total_eligible_expenditure  : num  0 0 0 0 742874 ...
 $ ir_last_version             : num  201801 201601 NA 201701 2016 ...
 $ reference_date              : POSIXct, format: "2017-12-31" "2015-12-31" NA "2016-12-31" ...

But I had used the link indicated as the API Endpoint (I did so, following the indication in the Code Snippets examples, so I figured that was the link to use)

d1 <- read.socrata( "https://cohesiondata.ec.europa.eu/resource/99js-gm52.json")

> str(d1 )
'data.frame':   50797 obs. of  24 variables:
 $ ms                          : chr  "FR" "IT" "DE" "GR" ...
 $ ms_name                     : chr  "France" "Italy" "Germany" "Greece" ...
 $ cci                         : chr  "2014FR16M2OP011" "2014IT16M2OP001" "2014DE06RDRP012" "2014GR16M1OP001" ...
 $ title                       : chr  "Guyane - ERDF/ESF" "Molise  - ERDF/ESF" "Niedersachsen / Bremen - Rural Development" "Transport Infrastructure Environment and Sustainable Development - GR - ERDF/CF" ...
 $ ver                         : chr  "2.0" "1.3" "1.6" "3.0" ...
 $ fund                        : chr  "ERDF" "ERDF" "EAFRD" "ERDF" ...
 $ priority                    : chr  "5" "5" "P4" "12" ...
 $ to                          : chr  "9" "6" "10" "6" ...
 $ to_short                    : chr  "Social Inclusion" "Environment Protection & Resource Efficiency" "Educational & Vocational Training" "Environment Protection & Resource Efficiency" ...
 $ to_long                     : chr  "Promoting social inclusion, combating poverty and any discrimination" "Preserving and protecting the environment and promoting resource efficiency" "Investing in education, training and vocational training for skills and lifelong learning" "Preserving and protecting the environment and promoting resource efficiency" ...
 $ category_of_region          : chr  "Less developed" "Transition" "VOID" "Transition" ...
 $ eafrd_measure               : chr  "VOID" "VOID" "M01" "VOID" ...
 $ measure_short_description   : chr  "VOID" "VOID" "Knowledge" "VOID" ...
 $ eafrd_fa                    : chr  "VOID" "VOID" "FA_P4" "VOID" ...
 $ focus_area_short_description: chr  "VOID" "VOID" "Priority 4" "VOID" ...
 $ eu_amount                   : chr  "10000000" "6883565" "38000000" "4941487" ...
 $ national_amount             : chr  "1800000" "6883565" "9500000" "1235372" ...
 $ total_amount                : chr  "11800000" "13767130" "47500000" "6176859" ...
 $ year                        : chr  "2017" "2015" "2014" "2016" ...
 $ eu_co_financing             : chr  "84.75" "50" "80" "80" ...
 $ total_eligible_cost         : chr  "5700000" "0" "0" "6444394.36" ...
 $ total_eligible_expenditure  : chr  "0" "0" "0" "0" ...
 $ ir_last_version             : chr  "201801.0" "201601.0" NA "201701.0" ...
 $ reference_date              : POSIXct, format: "2017-12-31" "2015-12-31" NA "2016-12-31" ...

Thanks!

tomschenkjr commented 3 years ago

Makes sense. When you don't specify the data type (e.g., just copy/paste the URL from address bar), RSocrata defaults to download from CSV. But, you can provide a valid API endpoint to declare a specific file type. Socrata is optimized for speed with JSON and is considerably faster so this is sometimes necessary. However, Socrata places quotes around all values regardless of original data type so R will interpret numbers as characters. Likewise, Socrata doesn't provide an API to read field-level metadata so we can't correct this on the back-end.

Will close this issue for now. We have some open issues on better handling JSON files that we can tackle if Socrata releases some improvements on metadata. Do reach out if you have any other issues.

Lulliter commented 3 years ago

OK! Many thanks for the explanation.