Benjamin's Initial Evaluation Brain Dump

nutterb commented 7 years ago

I apologize for the length but I wanted to make this available for review and discussion. I thought it would be good to get some feedback on ideas before I do any code work. i can break these up into separate issues when there's a specific aspect needing focus.

Also, be honest and direct in your responses. Part of this process will be transitioning me away from my redcapAPI mindset to the REDCapR mindset. I promise not to have hurt feelings if you don't want to incorporate anything in this list.

Observed Structure

Stand-alone functions

checkbox_choices
regex_named_captures
redcap_column_sanitize
replace_nas_with_explicit
The API Calls

This is the code I used to plot out how the existing API calls interact with each other. The blue boxes are my proposals for new functions. A child node indicates a function that is called within its parent.

    library(HydeNet)

    redcap <- 
      HydeNetwork(
        ~ redcap_metadata_read  + 
          redcap_read + 
          redcap_write + 
          redcap_download_file_oneshot + 
          redcap_upload_file_oneshot + 
          sanitize_token | redcap_metadata_read * 
                           redcap_read * 
                           redcap_write * 
                           redcap_download_file_oneshot * 
                           redcap_upload_file_oneshot +
          api_call | redcap_metadata_read * 
                           redcap_read * 
                           redcap_write * 
                           redcap_download_file_oneshot * 
                           redcap_upload_file_oneshot + 
          redcap_read_oneshot | redcap_read + 
          redcap_read_batch | redcap_read + 
          create_batch_glossary | redcap_read_batch *
                                  redcap_write_batch + 
          format_redcap_data | redcap_read + 
          redcap_write_oneshot | redcap_write +
          redcap_write_batch | redcap_write
      ) %>%
      setDecisionNodes("sanitize_token",
                       "format_redcap_data",
                       "api_call",
                       "redcap_read_batch",
                       "redcap_write_batch")

    plot(redcap)

redcap map

_Less complex Systems _

populate_project_simple calls redcap_project
retrieve_token_mssql calls retrieve_credential_mssql
retrieve_credential_local
validate_no_logical and validate_no_uppercase call validate_for_write
New functions

sanitize_token

As of 6.9.5,

The API is now more strict with regard to the validation of API tokens sent in API requests. In previous versions, if the token was longer than 32 characters, it would truncate the token to 32 characters (which is the expected length). It no longer truncates the token if longer than expected but merely returns an error message."

It wouldn't be hard to write a function to sanitize the token. It would look something like:

sanitize_token <- function(token){
  token <- trimws(token)
  token <- substr(token, 1, 32)
  token
}

This has the same effect as the code introduce through Issue #103 (https://github.com/OuhscBbmc/REDCapR/issues/103), and has the added benefit of providing consistency of tokens into earlier version of REDCap.

`api_call`

It might seem silly, but if the httr interface ever undergoes a change, or if there's ever a decision to change the package making the web call, putting all of the POST calls into one function means only have to change that one function, instead of finding all of the calls in the package.

`redcap_read_batch`

Would it simplify the interface for the user to have redcap_read be the driver for exporting data, and to make it a wrapper for redcap_read_oneshot and a new function redcap_read_batch? redcapAPI::exportRecords uses batch.size = -1 to indicate reading in one shot. Something similar could be done to direct traffic between the subfunctions without affecting backward compatibility (though redcap_read_oneshot would need to continue to be exported.)

`redcap_write_batch`

Similar concept to redcap_read_batch. Just make it one function to call for the user.

New Features

Data formatting: The meta data provide enough information to convert fields to factors, Date, or POSIXct objects, as appropriate. I propose a logical format_data argument to redcap_read that formats the data (similar to how redcapAPI formats) when TRUE. A new function format_data would be the engine to do this work.
Allow the meta data to be passed as an object to redcap_read and redcap_write. redcapAPI has a meta_data = NULL argument that allows the user to pass in the meta data data.frame if it exists. If given, there is no need to download the meta data again. If NULL, the meta data is downloaded. Not strictly necessary, but can save a bit of time. I'd consider this pretty low priority, because it is a feature that can be added without affect back compatibility.
New methods for exporting other data objects, such as the events table, the arms table, the mappings table. I don't know that these get used much, but they can be handy for formatting event names and such.
Provide internal validations prior to writing data. Honestly, I have mixed feelings about this. I put a lot of work into doing this in redcapAPI, but the REDCap validations return some pretty good messages as well.
Code Improvements

Sections of code like

events_collapsed <- ifelse(is.null(events), "", paste0(events, collapse=","))

can be optimized with

events_collapsed <- if (is.null(events)) "" else paste0(events, collapse = ",")

It isn't a big deal, (see benchmarking below, but as a package matures, I think it's a good thing to incorporate some efficiency gains)

library(microbenchmark)
library(magrittr)
records <- NULL
microbenchmark(
  ifelse = ifelse(is.null(records), "", paste0(records, collapse=",")),
  strict_if = if (is.null(records)) "" else paste0(records, collapse = "")
)

## Warning in microbenchmark(ifelse = ifelse(is.null(records), "",
## paste0(records, : Could not measure a positive execution time for 6
## evaluations.

## Unit: nanoseconds
##       expr  min   lq    mean median   uq   max neval cld
##     ifelse 1155 1540 1991.12   1541 1926 11934   100   b
##  strict_if    0    1  397.44      1    1 26948   100  a

Other Notes

Since we're using RODBCext to manage credentials, it wouldn't be hard to write up instructions for acessing credentials from any SQL database. Users would just need to know their driver. We could probably make an adaptation that pulls from a CSV using the sqldf package. It would make a consistent style of management that would encourage best practices, even for users who don't have the benefit of SQL databases.
We should do some benchmarking to see if an non-sparse eav of a large data file is faster than the csv export. How large does a database have to be with the eav format before the server times out? If we find the eav to be faster, perhaps we could up the batch number from 100 to something larger (fewer calls to the API would mean less wait time for the user, but I suspect you would want to continue batching to make the server available to other users. Upping the default batch number seems like a good compromise.)

I should also add that at one point I had compiled a list of the API changes announced on the group. I was making notes on what changes I might incorporate into redcapAPI and how those changes could affect the R package. The link to the spreadsheet is https://docs.google.com/spreadsheets/d/1NMdpb-k5nvrVxF0gvnpIfQyP4vZIpQgUxoLUKJuv2L0/edit?usp=sharing

wibeasley commented 7 years ago

@nutterb, sorry for not responding to your thoughtful post until now. I've read it several times since September. In short, I like your ideas and think they're all worth discussing. Some of these are large/mature/encapsulated enough they're spun off into entire issues.

Some specific reactions.

New Functions

redcap_read_batch() & redcap_write_batch(). See #145. If I use your approach of batching within the function, your advice makes even more sense.
api_call my thought is that would create another layer of indirection that I'm more likely to mess up in the future. Also, I'm guessing it would be easier to return better error messages if something like redcap_read() encounters the error (than if api_call()).

But I certainly agree that there's a lot of duplication across functions like redcap-metadata-read.R() and redcap-read.R(). I'm mildly convinced that I'm on the right side of the tradeoffs, but certainly could be persuaded otherwise.
I'm phasing out retrieve_token_zzz() functions with retrieve_credential_zzz calls, so that removes a little complexity, and I think becomes aligned with your nested advice. The credential (as opposed to token) retrieval is more general and should work in a wider variety of scenarios.
I like your sanitize_token(). See new issue #144.

New Features

Data formatting & additional metadata: Where would this metadata be stored?
- In REDCap annotation field (ie, the big dumb box)?
- Passed by the programmer, similar to readr's col()'
- Passed by the programmer, reusing readr's col()' (and then using readr's text-to-tbl functionality?)
Spin this off into a new issue if you'd like.
Other API calls, like events. Sounds good.
internal validations before writing from the client to REDCap? Can you point to examples?

Code Improvements

As long as the test suite has appropriate coverage of these areas, and doesn't break, I'm all for it. I had no idea ifelse() was so inferior to the approach you benchmarked. Any idea if dplyr::if_else() is as slow? I'll check it out if you haven't.

I think the lowest hanging fruit could still be associated with structuring network calls so they're less chatty & wasteful. But I don't have any ideas beyond EAV; and even that hits a limit and needs batching.

Other Notes

I agree with your two points. So far the EAV playground has been beating the flat export almost every time, even with non-sparse data. But I'd love to know more and describe the performance envelope to users.

Thanks again for all the thought you put into this, @nutterb.

nutterb commented 7 years ago

api_call my thought is that would create another layer of indirection that I'm more likely to mess up in the future. Also, I'm guessing it would be easier to return better error messages if something like redcap_read() encounters the error (than if api_call()).

I think this is something that the checkmate package handles well. I've been using it extensively in another package. It uses an assertCollection object that can be passed between functions. The assertions can be reported from the main function and it will appear to the user that all of the errors are coming from the function they executed.

A crude, silly example is below. A less trivial example is the interaction between the sprinkle_bg function and the index_to_sprinkle

had no idea ifelse() was so inferior to the approach you benchmarked. Any idea if dplyr::if_else() is as slow? I'll check it out if you haven't.

The if and else approach is faster, but usually only advantageous when dealing with a single logical value. ifelse keeps copies of both the yes and no elements, does a bunch of error checking, and a number of other things that make it very useful for vectorized conditionals, but quite a bit slower. dplyr::if_else has the same limitations. Compare the results of the second code block at the bottom.

But as you mentioned, these kinds of optimizations are not going to make a big difference to the user. optimizing the calls to the API will help a lot, as will optimizing the construction of data sets from batches. This is just one of those things that I change when I see them.

passing error messages with `checkmate`

main_function <- function(x, y, z)
{
  coll <- checkmate::makeAssertCollection()

  checkmate::assert_numeric(x = x,
                            len = 1,
                            add = coll)

  x_add <- sub_function1(x, y, coll)

  res <- sub_function2(x_add, z, coll)

  checkmate::reportAssertions(coll)

  res
}

sub_function1 <- function(x, y, coll)
{

  checkmate::assert_numeric(x = y,
                            len = 1,
                            add = coll,
                            .var.name = "y")

  x + y

}

sub_function2 <- function(x, z, coll)
{
  checkmate::assert_character(x = z,
                              len = 1,
                              add = coll,
                              .var.name = "z")

  paste0(round(x, 2), " ", z)
}

main_function(1, 2, "units")

main_function(1, 2, c("km", "m"))

If else benchmarking

 records <- NULL
microbenchmark(
  ifelse = ifelse(is.null(records), "", paste0(records, collapse=",")),
  if_else = if_else(is.null(records), "", paste0(records, collapse=",")),
  strict_if = if (is.null(records)) "" else paste0(records, collapse = ",")
)

x <- factor(sample(letters[1:5], 10, replace = TRUE))
microbenchmark(
  base = ifelse(x %in% c("a", "b", "c"), x, factor(NA)),
  dplyr = if_else(x %in% c("a", "b", "c"), x, factor(NA))
)

OuhscBbmc / REDCapR