chirunconf / chirunconf19

Discussion of potential projects for Chicago R Unconference, March 9-10, 2019
16 stars 2 forks source link

Open data access #6

Open emilyriederer opened 5 years ago

emilyriederer commented 5 years ago

Do you have any favorite sources of open data for personal or professional projects? Would you like an easier way to quickly pull them into R? The issue can help collect ideas for possible APIs in need on an R wrapper package (or static open data sets that could but shared via R package.)

maurolepore commented 5 years ago

Motivation

Lots of useful datasets live on GitHub. The gh package is a minimalistic GitHub API client that allows users --among other things -- to find and read datasets from R, but the interface is unfamiliar to most R users. I propose to wrap the gh package to provide an easier-to-use tool to explore files and read data from GitHub into R.

Goal

The goal is to allow users to reuse what they already know about exploring files locally (e.g. with ls() or fs::dir_ls(), and about specifying GitHub addresses (e.g. devtools::install_github("r-lib/gh@master")).

Status

The ghr package implements the basic infrastructure and functionality. It is in an early stage of development but I think it could reach a useful and robust state during chirunconf.

Demo

Here is a realistic application. And below is a toy example:

install.packages("devtools")
devtools::install_github("maurolepore/ghr")
library(purrr)
library(ghr)

# Familiar syntax, similar to the `repo` argument of `remotes::install_github()`
path <- "maurolepore/tor/inst/extdata/mixed@master"

# Familiar interface, similar to `fs::dir_ls()`
ghr_ls(path)
#> [1] "inst/extdata/mixed/csv.csv"          
#> [2] "inst/extdata/mixed/lower_rdata.rdata"
#> [3] "inst/extdata/mixed/rda.rda"          
#> [4] "inst/extdata/mixed/upper_rdata.RData"
ghr_ls(path, regexp = "[.]csv$", invert = TRUE)
#> [1] "inst/extdata/mixed/lower_rdata.rdata"
#> [2] "inst/extdata/mixed/rda.rda"          
#> [3] "inst/extdata/mixed/upper_rdata.RData"

# Easily read data directly from GitHub into R
path %>% 
  ghr_ls_download_url(regexp = "[.]csv$") %>% 
  readr::read_csv()
#> Parsed with column specification:
#> cols(
#>   y = col_character()
#> )
#> # A tibble: 2 x 1
#>   y    
#>   <chr>
#> 1 a    
#> 2 b
maurolepore commented 5 years ago

Motivation

A number of R packages provide access to open data by wrapping APIs. But using each API-wrapper-package usually requires understanding each API. This makes sense because each API has its own features. But when it comes to simply finding, selecting, and reading open data, Wouldn't it be nice to have a similar interface to access multiple open data sources?

Example

For example, the rdryad package is an R client for Dryad. Below is a super thin wrapper just to select and read data from dryad (data; code):

Proposition

In isolation, this may be not very exciting. But it might be interesting to provide a collection of such thin wrappers to the most popular open data sources, so that once you learn how to pull data from one source you know how to do it from any source.

sharlagelfand commented 5 years ago

Toronto Open Data has a new online portal with an API for accessing data sets (e.g. see under "For Developers" here: https://portal0.cf.opendata.inter.sandbox-toronto.ca/dataset/king-st-transit-pilot-traffic-pedestrian-volumes-summary/) -- they even have some sample R code for getting data sets! It would be interesting to create a simple R wrapper for this.

maurolepore commented 5 years ago

Similar to @sharlagelfand's comment, here is some data from Costa Rica that @ronnyhdez recently mentioned:

https://twitter.com/RonnyHdezM/status/1103026568157372416

image

ronnyhdez commented 5 years ago

Similar to @sharlagelfand's comment, here is some data from Costa Rica that @ronnyhdez recently mentioned:

https://twitter.com/RonnyHdezM/status/1103026568157372416

image

And this is the project if you want to check how we are using the junr package