cmu-delphi / epidatr

Delphi Epidata API R Client
https://cmu-delphi.github.io/epidatr/
Other
1 stars 5 forks source link

consider optimizing package organization for autocomplete/user assist #10

Open krivard opened 3 years ago

krivard commented 3 years ago

There are a ton of endpoints, and most users will only use a small number of them.

Current organization is flat, with endpoints, fetchers, and utility functions all mixed together:

* package
  * afhsb [endpoint]
  * cdc [endpoint]
  * covidcast_meta [endpoint]
  * delphi [endpoint]
  * dengue_nowcast [endpoint]
  * dengue_sensors [endpoint]
  * ecdc_ili [endpoint]
  * epirange [utility]
  * fetch_classic [fetcher]
  * fetch_csv [fetcher]
  * fetch_df [fetcher]
  * fetch_json [fetcher]
  * flusurv [endpoint]
  * fluview_meta [endpoint]
  * gft [endpoint]
  * ght [endpoint]
  * kcdc_ili [endpoint]
  * meta [endpoint]
  * meta_afhsb [endpoint]
  * meta_norostat [endpoint]
  * nidss_dengue [endpoint]
  * norostat [endpoint]
  * nowcast [endpoint]
  * paho_dengue [endpoint]
  * quidel [endpoint]
  * sensors [endpoint]
  * with_base_url [utility]

This organization is easiest to implement and automatically generates matching man pages, but it does flood autocomplete, making it difficult to refer to utility functions.

There are several alternative organization schemes we could consider, all with pros and cons:

Named lists

* package
  * endpoint$
    * afhsb 
    * cdc 
    * covidcast_meta 
    * delphi 
    * dengue_nowcast 
    * dengue_sensors 
    * ecdc_ili 
    * flusurv 
    * fluview_meta 
    * gft 
    * ght 
    * kcdc_ili 
    * meta 
    * meta_afhsb 
    * meta_norostat 
    * nidss_dengue 
    * norostat 
    * nowcast 
    * paho_dengue 
    * quidel 
    * sensors 
  * epirange
  * fetch$
    * classic
    * csv
    * df
    * json
  * with_base_url

To make this work, we'd have to specify our own man files for the lists and their members.

Prefixes

We're already considering adding pvt_ for the restricted endpoints; we could easily switch to eg endpoint_covidcast. Autocomplete would still be flooded but it would be easier for users to tell what was what.

R6 objects

Not familiar wtih this; maybe @brookslogan can advise?

capnrefsmmat commented 3 years ago

It seems like the key requirements are

The current version of endpoints.R suggests that most of the functions are simply input validation and parameter formatting -- i.e. mostly static data specifying arguments and formats, so a user's request can be formatted appropriately.

Given the above, I see a new option besides the ones above (named lists, prefixes):

Consolidated fetching

If the endpoint functions are mostly static data + validation, why have one function per endpoint? Instead, have fetch_df, fetch_classic, etc., to be used like this:

epidata::fetch_classic("cdc", epiweeks, locations)

fetch_classic would then look up the cdc endpoint in a big list, determine what format the arguments should have and the appropriate endpoint, make the request, and return a "classic" formatted object.

The function namespace hence does not contain anything about the endpoints, and so it's easy to get the utility functions. But we can still put the data into the help namespace, by using roxygen:

#' CDC endpoint.
#'
#' @usage{fetch_classic("cdc", epiweeks, locations)}
#'
#' @param epiweeks ....
#'
#' @name cdc
NULL

This would still be fetchable with ?cdc or help("cdc"). There'd be no string autocomplete while writing the fetch_classic call, but that's fine.

We'd need to experiment a bit to ensure the Rd system will accept this, but I think it will if we figure out the right way to do it.

Side notes

I actually think the autocomplete issue is not a serious issue here. The number of top-level endpoints is not huge, and users presumably come to Epidata with a vague idea of what kind of data they want. The issue is much bigger for signals within specific endpoints. covidcast is the biggest example, with signals nested within sources. But I don't think autocomplete is the right option there either; I think some functions that bring up the relevant documentation sections would be more useful. For example, perhaps signal_search("hospitalization") produces a table of all signals about hospitalization, and then signal_help("hhs", ...) brings me to the documentation site for a specific signal.

sgratzl commented 3 years ago
epidata::fetch_classic("cdc", epiweeks, locations)

while this reduces the number of exposed function it also has some drawbacks

brookslogan commented 3 years ago

Regarding R6 objects: this is more or less the "named list" approach, but maybe with some utilities that make various types of documentation easier. (But maybe also not quite the behavior that we might want.)

Regarding named list approaches:

Regarding consolidated fetching:

Regarding current approach:

Regarding prefixes:

On side notes:

sgratzl commented 3 years ago

with #12 the current structure is

covidcast_api <- covidcast_epidata()
epicall <- covidcast_api$sources$`fb-survey`$signals$smoothed_cli$call("nation", "us", epirange(20210405, 20210410))
epicall %>% fetch_json()

the $call is needed since the object contains a bunch of other meta data information about this signal such as description, name, ...

brookslogan commented 3 years ago

Cool! It seems possible to get rid of the covidcast_api assignment line and make this one long $ chain at the cost of some additional implementation complexity and effort. Here's an example of how this might work:

[EDIT: we should check out promises using delayedAssign before this ad-hoc thing below.]

#' @export
api_thunk_wrapper = function(api_thunk) {
  stopifnot(is.function(api_thunk))
  env = new.env(parent = emptyenv())
  env[["api_thunk"]] <- api_thunk
  unclassed = list(env = env)
  api.thunk.wrapper = `class<-`(unclassed, "api_thunk_wrapper")
  api.thunk.wrapper
}

#' @export
force_api_thunk_in_wrapper = function(api.thunk.wrapper) {
  if (!inherits(api.thunk.wrapper, "api_thunk_wrapper")) {
    stop ('Expected api.thunk.wrapper to be an api_thunk_wrapper object')
  }
  env = unclass(api.thunk.wrapper)[["env"]]
  if (is.null(env[["result"]])) {
    env[["result"]] <- env[["api_thunk"]]()
  }
  invisible(api.thunk.wrapper)
}

#' @export
print.api_thunk_wrapper = function(x, ...) {
  force_api_thunk_in_wrapper(x)
  cat(sprintf("<api_thunk_wrapper>: Provides access to API contents via $ (partial matching disallowed), [[, and [; lazily fetches API metadata needed for these operations just on first use.  Top-level API contents are: %s\n", toString(names(x))), fill=TRUE)
}

#' @export
`$.api_thunk_wrapper` = function(x, name) {
  force_api_thunk_in_wrapper(x)
  ## Use `[[` instead of `$` to disable partial matching.  (Note that if partial matching were desired, forwarding `$` in the same way as `[[` doesn't seem to work, at least on R version 4.0.5 (2021-03-31).)
  unclass(x)[["env"]][["result"]][[name]]
}

#' @export
`[[.api_thunk_wrapper` = function(x, index) {
  force_api_thunk_in_wrapper(x)
  unclass(x)[["env"]][["result"]][[index]]
}

#' @export
`[.api_thunk_wrapper` = function(x, indices) {
  force_api_thunk_in_wrapper(x)
  unclass(x)[["env"]][["result"]][indices]
}

#' @export
`names.api_thunk_wrapper` = function(x) {
  force_api_thunk_in_wrapper(x)
  names(unclass(x)[["env"]][["result"]])
}

#' @export
as.list.api_thunk_wrapper = function(x, ...) {
  force_api_thunk_in_wrapper(x)
  as.list(unclass(x)[["env"]][["result"]])
}

#' @export
`$<-.api_thunk_wrapper` = function(x, name, entry) {
  stop('`$<-` is disabled for api_thunk_wrapper')
}

#' @export
`[[<-.api_thunk_wrapper` = function(x, index, entry) {
  stop('`[[<-` is disabled for api_thunk_wrapper')
}

#' @export
`[<-.api_thunk_wrapper` = function(x, indices, entries) {
  stop('`[<-` is disabled for api_thunk_wrapper')
}

#' @export
`names<-.api_thunk_wrapper` = function(x, names) {
  stop('`names<-` is disabled for api_thunk_wrapper')
}

Toy use below:

cc_api = api_thunk_wrapper(function() list(a=2,b=3,c=4))
cc_api$<tab> # completions work in ESS; haven't tried RStudio
cc_api$a # [1] 2

Actual use might be something like

#' @export
covidcast_api <- api_thunk_wrapper(covidcast_epidata)

and not exporting covidcast_epidata.

(This assumes no one will be running a really long R session and then trying to use new covidcast data sources that weren't created at the time they first used covidcast that R session.)


EDIT: additional methods to provide implementations for may include length and .DollarNames.

brookslogan commented 3 years ago

In a similar vein, the $call might be avoided by using something like a closure:

g = (function() { a = 2; function(x) x+a })()
environment(g)[["a"]] # [1] 2
g(4) # [1] 6
brookslogan commented 2 years ago

Noting here and above that there are already built-in mechanisms in R for thunks that look like "real" objects:

brookslogan commented 2 years ago

Messing around with covidcast_api = delphi.epidata::covidcast_epidata() above in spacemacs + ESS:

nmdefries commented 11 months ago

Progress so far:

Endpoints have been prefixed with either pvt_ or pub_, depending on required authentication. We've created a helper function covidcast_epidata() that generates a nested list object to help with autocomplete and signal/source discovery.

However, covidcast_epidata has some usability issues. It:

How much effort would it be to improve this?

brookslogan commented 11 months ago

Very cursory guesses

A couple more things that came to mind: