cmu-delphi / epidatr

Delphi Epidata API R Client

https://cmu-delphi.github.io/epidatr/

Other

1 stars 5 forks source link

consider optimizing package organization for autocomplete/user assist #10

Open krivard opened 3 years ago

krivard commented 3 years ago

There are a ton of endpoints, and most users will only use a small number of them.

Current organization is flat, with endpoints, fetchers, and utility functions all mixed together:

* package
  * afhsb [endpoint]
  * cdc [endpoint]
  * covidcast_meta [endpoint]
  * delphi [endpoint]
  * dengue_nowcast [endpoint]
  * dengue_sensors [endpoint]
  * ecdc_ili [endpoint]
  * epirange [utility]
  * fetch_classic [fetcher]
  * fetch_csv [fetcher]
  * fetch_df [fetcher]
  * fetch_json [fetcher]
  * flusurv [endpoint]
  * fluview_meta [endpoint]
  * gft [endpoint]
  * ght [endpoint]
  * kcdc_ili [endpoint]
  * meta [endpoint]
  * meta_afhsb [endpoint]
  * meta_norostat [endpoint]
  * nidss_dengue [endpoint]
  * norostat [endpoint]
  * nowcast [endpoint]
  * paho_dengue [endpoint]
  * quidel [endpoint]
  * sensors [endpoint]
  * with_base_url [utility]

This organization is easiest to implement and automatically generates matching man pages, but it does flood autocomplete, making it difficult to refer to utility functions.

There are several alternative organization schemes we could consider, all with pros and cons:

Named lists
Prefixes
R6 objects

Named lists

* package
  * endpoint$
    * afhsb 
    * cdc 
    * covidcast_meta 
    * delphi 
    * dengue_nowcast 
    * dengue_sensors 
    * ecdc_ili 
    * flusurv 
    * fluview_meta 
    * gft 
    * ght 
    * kcdc_ili 
    * meta 
    * meta_afhsb 
    * meta_norostat 
    * nidss_dengue 
    * norostat 
    * nowcast 
    * paho_dengue 
    * quidel 
    * sensors 
  * epirange
  * fetch$
    * classic
    * csv
    * df
    * json
  * with_base_url

To make this work, we'd have to specify our own man files for the lists and their members.

Prefixes

We're already considering adding pvt_ for the restricted endpoints; we could easily switch to eg endpoint_covidcast. Autocomplete would still be flooded but it would be easier for users to tell what was what.

R6 objects

Not familiar wtih this; maybe @brookslogan can advise?

capnrefsmmat commented 3 years ago

It seems like the key requirements are

Make it easy to find the name of an endpoint or utility function via autocomplete
Make it easy to look up documentation on endpoints, including the required parameters and format of the output
Allow users to select the format of output (such as fetch_classic, fetch_df)

The current version of endpoints.R suggests that most of the functions are simply input validation and parameter formatting -- i.e. mostly static data specifying arguments and formats, so a user's request can be formatted appropriately.

Given the above, I see a new option besides the ones above (named lists, prefixes):

Consolidated fetching

If the endpoint functions are mostly static data + validation, why have one function per endpoint? Instead, have fetch_df, fetch_classic, etc., to be used like this:

epidata::fetch_classic("cdc", epiweeks, locations)

fetch_classic would then look up the cdc endpoint in a big list, determine what format the arguments should have and the appropriate endpoint, make the request, and return a "classic" formatted object.

The function namespace hence does not contain anything about the endpoints, and so it's easy to get the utility functions. But we can still put the data into the help namespace, by using roxygen:

#' CDC endpoint.
#'
#' @usage{fetch_classic("cdc", epiweeks, locations)}
#'
#' @param epiweeks ....
#'
#' @name cdc
NULL

This would still be fetchable with ?cdc or help("cdc"). There'd be no string autocomplete while writing the fetch_classic call, but that's fine.

We'd need to experiment a bit to ensure the Rd system will accept this, but I think it will if we figure out the right way to do it.

Side notes

I actually think the autocomplete issue is not a serious issue here. The number of top-level endpoints is not huge, and users presumably come to Epidata with a vague idea of what kind of data they want. The issue is much bigger for signals within specific endpoints. covidcast is the biggest example, with signals nested within sources. But I don't think autocomplete is the right option there either; I think some functions that bring up the relevant documentation sections would be more useful. For example, perhaps signal_search("hospitalization") produces a table of all signals about hospitalization, and then signal_help("hhs", ...) brings me to the documentation site for a specific signal.

sgratzl commented 3 years ago

epidata::fetch_classic("cdc", epiweeks, locations)

while this reduces the number of exposed function it also has some drawbacks

the logic for checking the arguments is more complex
it will mix parameters (epiweeks) with data fetching options (fields, disable_date_parsing)

brookslogan commented 3 years ago

Regarding R6 objects: this is more or less the "named list" approach, but maybe with some utilities that make various types of documentation easier. (But maybe also not quite the behavior that we might want.)

R6 works more along the lines of Java; objects are environments (with a special S3 class); they have reference semantics rather than copy-on-write, and can have public and private members. I think R6 objects and their member functions may already have some special print functionality, not sure if it'd be exactly what we want. Roxygen has some special treatment of R6 classes now, but it's insufficient in my mind because it places all documentation in the class's topic rather than also documenting the members in their own topics.

Regarding named list approaches:

A drawback here is that ?a$b and help(a$b) don't appear to work, and users may not think to try ?"a$b" or help("a$b") or ?b or help(b), which I believe could work.

Regarding consolidated fetching:

My opinion is that this would be a step backward. The endpoints are heterogeneous, so the fetching function argument name autocomplete can't tell them what precisely the arguments should be, and it seems messier to handle. Flooding autocomplete is probably better than no autocomplete.

Regarding current approach:

Maybe the documentation side could be partially addressed with roxygen features like @family.
Yes, maybe there aren't that many members.

Regarding prefixes:

"pvt_": I think this is a good idea if there are a lot of private endpoints that aren't shared outside Delphi except for maybe the data provider themselves (think this may include afhsb, quidel, cdc, twitter, ... --- but now I see now that some of these aren't even on the list of endpoints to begin with; maybe there aren't actually a lot)
In general: I think this approach and the currently-implemented approach are probably the top candidates.

On side notes:

Yes, signal completion for covidcast is probably a bigger issue. I don't think this is autocomplete VS. bringing up documentation necessarily. You can set it up so that covidcast$"jhu-csse" will print documentation for jhu-csse including a signal list and maybe descriptions, and covidcast$"jhu-csse"$confirmed_incidence_num prints documentation for that signal. This still provides corresponding autocomplete entries, although I don't think Emacs+ESS actually validly completes things when a hyphen or quotes or backquotes are involved. Then it would be an interface decision between options like
```
covidcast$"jhu-csse"$confirmed_incidence_num(<args>)
```
or
```
covidcast(covidcast$sources$"jhu-csse", covidcast$signals$"jhu-csse"$confirmed_incidence_num, <args>)
```
or
```
covidcast(covidcast$signals$"jhu-csse"$confirmed_incidence_num, <args>)
```
or variants like
```
covidcast$"jhu-csse"$confirmed_incidence_num$state$day(<args>)
```

sgratzl commented 3 years ago

with #12 the current structure is

covidcast_api <- covidcast_epidata()
epicall <- covidcast_api$sources$`fb-survey`$signals$smoothed_cli$call("nation", "us", epirange(20210405, 20210410))
epicall %>% fetch_json()

the $call is needed since the object contains a bunch of other meta data information about this signal such as description, name, ...

brookslogan commented 3 years ago

Cool! It seems possible to get rid of the covidcast_api assignment line and make this one long $ chain at the cost of some additional implementation complexity and effort. Here's an example of how this might work:

[EDIT: we should check out promises using delayedAssign before this ad-hoc thing below.]

#' @export
api_thunk_wrapper = function(api_thunk) {
  stopifnot(is.function(api_thunk))
  env = new.env(parent = emptyenv())
  env[["api_thunk"]] <- api_thunk
  unclassed = list(env = env)
  api.thunk.wrapper = `class<-`(unclassed, "api_thunk_wrapper")
  api.thunk.wrapper
}

#' @export
force_api_thunk_in_wrapper = function(api.thunk.wrapper) {
  if (!inherits(api.thunk.wrapper, "api_thunk_wrapper")) {
    stop ('Expected api.thunk.wrapper to be an api_thunk_wrapper object')
  }
  env = unclass(api.thunk.wrapper)[["env"]]
  if (is.null(env[["result"]])) {
    env[["result"]] <- env[["api_thunk"]]()
  }
  invisible(api.thunk.wrapper)
}

#' @export
print.api_thunk_wrapper = function(x, ...) {
  force_api_thunk_in_wrapper(x)
  cat(sprintf("<api_thunk_wrapper>: Provides access to API contents via $ (partial matching disallowed), [[, and [; lazily fetches API metadata needed for these operations just on first use.  Top-level API contents are: %s\n", toString(names(x))), fill=TRUE)
}

#' @export
`$.api_thunk_wrapper` = function(x, name) {
  force_api_thunk_in_wrapper(x)
  ## Use `[[` instead of `$` to disable partial matching.  (Note that if partial matching were desired, forwarding `$` in the same way as `[[` doesn't seem to work, at least on R version 4.0.5 (2021-03-31).)
  unclass(x)[["env"]][["result"]][[name]]
}

#' @export
`[[.api_thunk_wrapper` = function(x, index) {
  force_api_thunk_in_wrapper(x)
  unclass(x)[["env"]][["result"]][[index]]
}

#' @export
`[.api_thunk_wrapper` = function(x, indices) {
  force_api_thunk_in_wrapper(x)
  unclass(x)[["env"]][["result"]][indices]
}

#' @export
`names.api_thunk_wrapper` = function(x) {
  force_api_thunk_in_wrapper(x)
  names(unclass(x)[["env"]][["result"]])
}

#' @export
as.list.api_thunk_wrapper = function(x, ...) {
  force_api_thunk_in_wrapper(x)
  as.list(unclass(x)[["env"]][["result"]])
}

#' @export
`$<-.api_thunk_wrapper` = function(x, name, entry) {
  stop('`$<-` is disabled for api_thunk_wrapper')
}

#' @export
`[[<-.api_thunk_wrapper` = function(x, index, entry) {
  stop('`[[<-` is disabled for api_thunk_wrapper')
}

#' @export
`[<-.api_thunk_wrapper` = function(x, indices, entries) {
  stop('`[<-` is disabled for api_thunk_wrapper')
}

#' @export
`names<-.api_thunk_wrapper` = function(x, names) {
  stop('`names<-` is disabled for api_thunk_wrapper')
}

Toy use below:

cc_api = api_thunk_wrapper(function() list(a=2,b=3,c=4))
cc_api$<tab> # completions work in ESS; haven't tried RStudio
cc_api$a # [1] 2

Actual use might be something like

#' @export
covidcast_api <- api_thunk_wrapper(covidcast_epidata)

and not exporting covidcast_epidata.

(This assumes no one will be running a really long R session and then trying to use new covidcast data sources that weren't created at the time they first used covidcast that R session.)

EDIT: additional methods to provide implementations for may include length and .DollarNames.

brookslogan commented 3 years ago

In a similar vein, the $call might be avoided by using something like a closure:

g = (function() { a = 2; function(x) x+a })()
environment(g)[["a"]] # [1] 2
g(4) # [1] 6

brookslogan commented 2 years ago

Noting here and above that there are already built-in mechanisms in R for thunks that look like "real" objects:

promises, e.g., using delayedAssign, which are evaluated once
active bindings, e.g., using makeActiveBinding, which are re-evaluated each time that code "gets" the binding value

brookslogan commented 2 years ago

Messing around with covidcast_api = delphi.epidata::covidcast_epidata() above in spacemacs + ESS:

Starting completion via covidcast_api$signals$<tab> and typing in source names and keywords in the helm/ivy completion seems to work pretty well for exploring available signals. [--- but each completion option includes the source name twice --- this is #64]
Starting completion via covidcast_api$sources$`jhu-csse`$signals$<tab> or covidcast_api$sources$"jhu-csse"$signals$<tab> doesn't work, but it does work on the hhs source. This seems like an ESS bug, but nevertheless does make sources route much less useful since many source names contain hyphens.
Actually completing a signal or source name with hyphens or other problematic characters doesn't properly quote it (e.g., giving covidcast_api$sources$jhu-csse, resulting in invalid code. Another ESS bug, easy but a little annoying for user to work around.
[This may already be addressed by fix for #43] Having print implementations for intermediate results would be very helpful (especially given some of the ESS bugs above).

nmdefries commented 11 months ago

Progress so far:

Endpoints have been prefixed with either pvt_ or pub_, depending on required authentication. We've created a helper function covidcast_epidata() that generates a nested list object to help with autocomplete and signal/source discovery.

However, covidcast_epidata has some usability issues. It:

isn't comprehensive
is a function not a promise or active binding, making autocomplete a pain to start; the user has to assign its output to a variable first
still has autocomplete chains broken in many editors due to special characters not triggering quoting on completion
completion chains could potentially be tightened up... I don't think $call is actually necessary probably, and it's maybe a little misnamed now that we fetch by default
(could maybe benefit from disease-based listings, etc.)

How much effort would it be to improve this?

brookslogan commented 11 months ago

Very cursory guesses

Comprehensiveness. Blocked on the usable documentation (is this the right name?) project which would provide a metadata endpoint similar to what covidcast_epidata() uses, but for all endpoints. Would take maybe 4--10 days depending on what upstream metadata endpoint eventually looks like, and if any epidatr internals need overhauled. Only 1 or 1.5 vignettes probably need changed?
Function -> promise or active binding: 1--4 days. Could potentially be a very small or simple code change. Testing probably takes more time. Only 1 or 1.5 vignettes probably need changed.
- Potential complication: are we okay with not detecting newly released (meta)data from the API until a new session (delayedAssign), or pinging the metadata endpoint a lot (active binding)? Or will we want to use a cache system with an active binding?
Editor stuff: 3 days or abandon? If we can fix this in epidatr, it's probably via some simple tweaks to a names<- implementation for the classes we use under the hood. But then we'd want to test on a few editors. If we can't make them all behave at the same time then probably we should just make sure it's reported upstream & abandon.
Completion chain restructuring:
- Removing $call: 1.5 days? This would probably be changing endpoints from lists into functions with attributes storing the list data & with an S3 "subclass" to enable printing of this info.
Disease-based listings, etc.: maybe also blocked by other projects that aren't underway. We want to reorganize docs for the web site & also reveal that info via metadata endpoints. Then we could update the interface to take advantage of whatever breakdowns are provided.

A couple more things that came to mind:

Improve print. methods for covidcast_epidata related classes. Maybe 1 day. At least in my version of epidatr, they don't hint that they are actually of a special class, and don't instruct what to do with them. For example, endpoints print as if they are character vectors; I wouldn't know that I could use $call to access the data.
(Think about) Future-proof(ing) for uniformity: maybe 2--15 days? Current $call objects have this interface
```
function (geo_type, geo_values, time_values, as_of = NULL, issues = NULL, 
lag = NULL, fetch_args = fetch_args_list()) 
```
This appears to work for covidcast since each source only has one time_type (is this true?). But will it work when we think about applying this globally or to future data?