Open emilyriederer opened 5 years ago
Lots of useful datasets live on GitHub. The gh package is a minimalistic GitHub API client that allows users --among other things -- to find and read datasets from R, but the interface is unfamiliar to most R users. I propose to wrap the gh package to provide an easier-to-use tool to explore files and read data from GitHub into R.
The goal is to allow users to reuse what they already know about exploring files locally (e.g. with ls()
or fs::dir_ls()
, and about specifying GitHub addresses (e.g. devtools::install_github("r-lib/gh@master")
).
The ghr package implements the basic infrastructure and functionality. It is in an early stage of development but I think it could reach a useful and robust state during chirunconf.
Here is a realistic application. And below is a toy example:
install.packages("devtools")
devtools::install_github("maurolepore/ghr")
library(purrr)
library(ghr)
# Familiar syntax, similar to the `repo` argument of `remotes::install_github()`
path <- "maurolepore/tor/inst/extdata/mixed@master"
# Familiar interface, similar to `fs::dir_ls()`
ghr_ls(path)
#> [1] "inst/extdata/mixed/csv.csv"
#> [2] "inst/extdata/mixed/lower_rdata.rdata"
#> [3] "inst/extdata/mixed/rda.rda"
#> [4] "inst/extdata/mixed/upper_rdata.RData"
ghr_ls(path, regexp = "[.]csv$", invert = TRUE)
#> [1] "inst/extdata/mixed/lower_rdata.rdata"
#> [2] "inst/extdata/mixed/rda.rda"
#> [3] "inst/extdata/mixed/upper_rdata.RData"
# Easily read data directly from GitHub into R
path %>%
ghr_ls_download_url(regexp = "[.]csv$") %>%
readr::read_csv()
#> Parsed with column specification:
#> cols(
#> y = col_character()
#> )
#> # A tibble: 2 x 1
#> y
#> <chr>
#> 1 a
#> 2 b
A number of R packages provide access to open data by wrapping APIs. But using each API-wrapper-package usually requires understanding each API. This makes sense because each API has its own features. But when it comes to simply finding, selecting, and reading open data, Wouldn't it be nice to have a similar interface to access multiple open data sources?
For example, the rdryad package is an R client for Dryad. Below is a super thin wrapper just to select and read data from dryad (data; code):
In isolation, this may be not very exciting. But it might be interesting to provide a collection of such thin wrappers to the most popular open data sources, so that once you learn how to pull data from one source you know how to do it from any source.
Toronto Open Data has a new online portal with an API for accessing data sets (e.g. see under "For Developers" here: https://portal0.cf.opendata.inter.sandbox-toronto.ca/dataset/king-st-transit-pilot-traffic-pedestrian-volumes-summary/) -- they even have some sample R code for getting data sets! It would be interesting to create a simple R wrapper for this.
Similar to @sharlagelfand's comment, here is some data from Costa Rica that @ronnyhdez recently mentioned:
Similar to @sharlagelfand's comment, here is some data from Costa Rica that @ronnyhdez recently mentioned:
And this is the project if you want to check how we are using the junr package
Do you have any favorite sources of open data for personal or professional projects? Would you like an easier way to quickly pull them into R? The issue can help collect ideas for possible APIs in need on an R wrapper package (or static open data sets that could but shared via R package.)