Open krivard opened 3 years ago
It seems like the key requirements are
fetch_classic
, fetch_df
)The current version of endpoints.R suggests that most of the functions are simply input validation and parameter formatting -- i.e. mostly static data specifying arguments and formats, so a user's request can be formatted appropriately.
Given the above, I see a new option besides the ones above (named lists, prefixes):
If the endpoint functions are mostly static data + validation, why have one function per endpoint? Instead, have fetch_df
, fetch_classic
, etc., to be used like this:
epidata::fetch_classic("cdc", epiweeks, locations)
fetch_classic
would then look up the cdc
endpoint in a big list, determine what format the arguments should have and the appropriate endpoint, make the request, and return a "classic" formatted object.
The function namespace hence does not contain anything about the endpoints, and so it's easy to get the utility functions. But we can still put the data into the help namespace, by using roxygen:
#' CDC endpoint.
#'
#' @usage{fetch_classic("cdc", epiweeks, locations)}
#'
#' @param epiweeks ....
#'
#' @name cdc
NULL
This would still be fetchable with ?cdc
or help("cdc")
. There'd be no string autocomplete while writing the fetch_classic
call, but that's fine.
We'd need to experiment a bit to ensure the Rd system will accept this, but I think it will if we figure out the right way to do it.
I actually think the autocomplete issue is not a serious issue here. The number of top-level endpoints is not huge, and users presumably come to Epidata with a vague idea of what kind of data they want. The issue is much bigger for signals within specific endpoints. covidcast is the biggest example, with signals nested within sources. But I don't think autocomplete is the right option there either; I think some functions that bring up the relevant documentation sections would be more useful. For example, perhaps signal_search("hospitalization")
produces a table of all signals about hospitalization, and then signal_help("hhs", ...)
brings me to the documentation site for a specific signal.
epidata::fetch_classic("cdc", epiweeks, locations)
while this reduces the number of exposed function it also has some drawbacks
Regarding R6 objects: this is more or less the "named list" approach, but maybe with some utilities that make various types of documentation easier. (But maybe also not quite the behavior that we might want.)
print
functionality, not sure if it'd be exactly what we want. Roxygen has some special treatment of R6 classes now, but it's insufficient in my mind because it places all documentation in the class's topic rather than also documenting the members in their own topics.Regarding named list approaches:
?a$b
and help(a$b)
don't appear to work, and users may not think to try ?"a$b"
or help("a$b")
or ?b
or help(b)
, which I believe could work.Regarding consolidated fetching:
Regarding current approach:
@family
.Regarding prefixes:
On side notes:
covidcast$"jhu-csse"
will print documentation for jhu-csse including a signal list and maybe descriptions, and covidcast$"jhu-csse"$confirmed_incidence_num
prints documentation for that signal. This still provides corresponding autocomplete entries, although I don't think Emacs+ESS actually validly completes things when a hyphen or quotes or backquotes are involved. Then it would be an interface decision between options like
covidcast$"jhu-csse"$confirmed_incidence_num(<args>)
or
covidcast(covidcast$sources$"jhu-csse", covidcast$signals$"jhu-csse"$confirmed_incidence_num, <args>)
or
covidcast(covidcast$signals$"jhu-csse"$confirmed_incidence_num, <args>)
or variants like
covidcast$"jhu-csse"$confirmed_incidence_num$state$day(<args>)
with #12 the current structure is
covidcast_api <- covidcast_epidata()
epicall <- covidcast_api$sources$`fb-survey`$signals$smoothed_cli$call("nation", "us", epirange(20210405, 20210410))
epicall %>% fetch_json()
the $call
is needed since the object contains a bunch of other meta data information about this signal such as description, name, ...
Cool! It seems possible to get rid of the covidcast_api
assignment line and make this one long $ chain at the cost of some additional implementation complexity and effort. Here's an example of how this might work:
[EDIT: we should check out promises using delayedAssign
before this ad-hoc thing below.]
#' @export
api_thunk_wrapper = function(api_thunk) {
stopifnot(is.function(api_thunk))
env = new.env(parent = emptyenv())
env[["api_thunk"]] <- api_thunk
unclassed = list(env = env)
api.thunk.wrapper = `class<-`(unclassed, "api_thunk_wrapper")
api.thunk.wrapper
}
#' @export
force_api_thunk_in_wrapper = function(api.thunk.wrapper) {
if (!inherits(api.thunk.wrapper, "api_thunk_wrapper")) {
stop ('Expected api.thunk.wrapper to be an api_thunk_wrapper object')
}
env = unclass(api.thunk.wrapper)[["env"]]
if (is.null(env[["result"]])) {
env[["result"]] <- env[["api_thunk"]]()
}
invisible(api.thunk.wrapper)
}
#' @export
print.api_thunk_wrapper = function(x, ...) {
force_api_thunk_in_wrapper(x)
cat(sprintf("<api_thunk_wrapper>: Provides access to API contents via $ (partial matching disallowed), [[, and [; lazily fetches API metadata needed for these operations just on first use. Top-level API contents are: %s\n", toString(names(x))), fill=TRUE)
}
#' @export
`$.api_thunk_wrapper` = function(x, name) {
force_api_thunk_in_wrapper(x)
## Use `[[` instead of `$` to disable partial matching. (Note that if partial matching were desired, forwarding `$` in the same way as `[[` doesn't seem to work, at least on R version 4.0.5 (2021-03-31).)
unclass(x)[["env"]][["result"]][[name]]
}
#' @export
`[[.api_thunk_wrapper` = function(x, index) {
force_api_thunk_in_wrapper(x)
unclass(x)[["env"]][["result"]][[index]]
}
#' @export
`[.api_thunk_wrapper` = function(x, indices) {
force_api_thunk_in_wrapper(x)
unclass(x)[["env"]][["result"]][indices]
}
#' @export
`names.api_thunk_wrapper` = function(x) {
force_api_thunk_in_wrapper(x)
names(unclass(x)[["env"]][["result"]])
}
#' @export
as.list.api_thunk_wrapper = function(x, ...) {
force_api_thunk_in_wrapper(x)
as.list(unclass(x)[["env"]][["result"]])
}
#' @export
`$<-.api_thunk_wrapper` = function(x, name, entry) {
stop('`$<-` is disabled for api_thunk_wrapper')
}
#' @export
`[[<-.api_thunk_wrapper` = function(x, index, entry) {
stop('`[[<-` is disabled for api_thunk_wrapper')
}
#' @export
`[<-.api_thunk_wrapper` = function(x, indices, entries) {
stop('`[<-` is disabled for api_thunk_wrapper')
}
#' @export
`names<-.api_thunk_wrapper` = function(x, names) {
stop('`names<-` is disabled for api_thunk_wrapper')
}
Toy use below:
cc_api = api_thunk_wrapper(function() list(a=2,b=3,c=4))
cc_api$<tab> # completions work in ESS; haven't tried RStudio
cc_api$a # [1] 2
Actual use might be something like
#' @export
covidcast_api <- api_thunk_wrapper(covidcast_epidata)
and not exporting covidcast_epidata
.
(This assumes no one will be running a really long R session and then trying to use new covidcast data sources that weren't created at the time they first used covidcast that R session.)
EDIT: additional methods to provide implementations for may include length
and .DollarNames
.
In a similar vein, the $call
might be avoided by using something like a closure:
g = (function() { a = 2; function(x) x+a })()
environment(g)[["a"]] # [1] 2
g(4) # [1] 6
Noting here and above that there are already built-in mechanisms in R for thunks that look like "real" objects:
delayedAssign
, which are evaluated oncemakeActiveBinding
, which are re-evaluated each time that code "gets" the binding valueMessing around with covidcast_api = delphi.epidata::covidcast_epidata()
above in spacemacs + ESS:
covidcast_api$signals$<tab>
and typing in source names and keywords in the helm/ivy completion seems to work pretty well for exploring available signals. [--- but each completion option includes the source name twice --- this is #64] covidcast_api$sources$`jhu-csse`$signals$<tab>
or covidcast_api$sources$"jhu-csse"$signals$<tab>
doesn't work, but it does work on the hhs source. This seems like an ESS bug, but nevertheless does make sources
route much less useful since many source names contain hyphens.covidcast_api$sources$jhu-csse
, resulting in invalid code. Another ESS bug, easy but a little annoying for user to work around.Progress so far:
Endpoints have been prefixed with either pvt_
or pub_
, depending on required authentication. We've created a helper function covidcast_epidata()
that generates a nested list object to help with autocomplete and signal/source discovery.
However, covidcast_epidata
has some usability issues. It:
$call
is actually necessary probably, and it's maybe a little misnamed now that we fetch by defaultHow much effort would it be to improve this?
Very cursory guesses
covidcast_epidata()
uses, but for all endpoints. Would take maybe 4--10 days depending on what upstream metadata endpoint eventually looks like, and if any epidatr internals need overhauled. Only 1 or 1.5 vignettes probably need changed?names<-
implementation for the classes we use under the hood. But then we'd want to test on a few editors. If we can't make them all behave at the same time then probably we should just make sure it's reported upstream & abandon.$call
: 1.5 days? This would probably be changing endpoints from lists into functions with attributes storing the list data & with an S3 "subclass" to enable printing of this info.A couple more things that came to mind:
print.
methods for covidcast_epidata
related classes. Maybe 1 day. At least in my version of epidatr, they don't hint that they are actually of a special class, and don't instruct what to do with them. For example, endpoints print as if they are character vectors; I wouldn't know that I could use $call
to access the data.$call
objects have this interface
function (geo_type, geo_values, time_values, as_of = NULL, issues = NULL,
lag = NULL, fetch_args = fetch_args_list())
This appears to work for covidcast since each source only has one time_type
(is this true?). But will it work when we think about applying this globally or to future data?
There are a ton of endpoints, and most users will only use a small number of them.
Current organization is flat, with endpoints, fetchers, and utility functions all mixed together:
This organization is easiest to implement and automatically generates matching man pages, but it does flood autocomplete, making it difficult to refer to utility functions.
There are several alternative organization schemes we could consider, all with pros and cons:
Named lists
To make this work, we'd have to specify our own man files for the lists and their members.
Prefixes
We're already considering adding
pvt_
for the restricted endpoints; we could easily switch to egendpoint_covidcast
. Autocomplete would still be flooded but it would be easier for users to tell what was what.R6 objects
Not familiar wtih this; maybe @brookslogan can advise?