Open nutterb opened 7 years ago
@nutterb, sorry for not responding to your thoughtful post until now. I've read it several times since September. In short, I like your ideas and think they're all worth discussing. Some of these are large/mature/encapsulated enough they're spun off into entire issues.
Some specific reactions.
redcap_read_batch()
& redcap_write_batch()
. See #145. If I use your approach of batching within the function, your advice makes even more sense.
api_call
my thought is that would create another layer of indirection that I'm more likely to mess up in the future. Also, I'm guessing it would be easier to return better error messages if something like redcap_read()
encounters the error (than if api_call()
).
But I certainly agree that there's a lot of duplication across functions like redcap-metadata-read.R()
and redcap-read.R().
I'm mildly convinced that I'm on the right side of the tradeoffs, but certainly could be persuaded otherwise.
I'm phasing out retrieve_token_zzz()
functions with retrieve_credential_zzz
calls, so that removes a little complexity, and I think becomes aligned with your nested advice. The credential (as opposed to token) retrieval is more general and should work in a wider variety of scenarios.
I like your sanitize_token()
. See new issue #144.
Data formatting & additional metadata: Where would this metadata be stored?
readr
's col()
'readr
's col()
' (and then using readr's text-to-tbl functionality?)Spin this off into a new issue if you'd like.
Other API calls, like events. Sounds good.
internal validations before writing from the client to REDCap? Can you point to examples?
As long as the test suite has appropriate coverage of these areas, and doesn't break, I'm all for it. I had no idea ifelse()
was so inferior to the approach you benchmarked. Any idea if dplyr::if_else()
is as slow? I'll check it out if you haven't.
I think the lowest hanging fruit could still be associated with structuring network calls so they're less chatty & wasteful. But I don't have any ideas beyond EAV; and even that hits a limit and needs batching.
I agree with your two points. So far the EAV playground has been beating the flat export almost every time, even with non-sparse data. But I'd love to know more and describe the performance envelope to users.
Thanks again for all the thought you put into this, @nutterb.
api_call
my thought is that would create another layer of indirection that I'm more likely to mess up in the future. Also, I'm guessing it would be easier to return better error messages if something likeredcap_read()
encounters the error (than ifapi_call()
).
I think this is something that the checkmate
package handles well. I've been using it extensively in another package. It uses an assertCollection
object that can be passed between functions. The assertions can be reported from the main function and it will appear to the user that all of the errors are coming from the function they executed.
A crude, silly example is below. A less trivial example is the interaction between the sprinkle_bg function and the index_to_sprinkle
had no idea ifelse() was so inferior to the approach you benchmarked. Any idea if dplyr::if_else() is as slow? I'll check it out if you haven't.
The if
and else
approach is faster, but usually only advantageous when dealing with a single logical value. ifelse
keeps copies of both the yes
and no
elements, does a bunch of error checking, and a number of other things that make it very useful for vectorized conditionals, but quite a bit slower. dplyr::if_else
has the same limitations. Compare the results of the second code block at the bottom.
But as you mentioned, these kinds of optimizations are not going to make a big difference to the user. optimizing the calls to the API will help a lot, as will optimizing the construction of data sets from batches. This is just one of those things that I change when I see them.
checkmate
main_function <- function(x, y, z)
{
coll <- checkmate::makeAssertCollection()
checkmate::assert_numeric(x = x,
len = 1,
add = coll)
x_add <- sub_function1(x, y, coll)
res <- sub_function2(x_add, z, coll)
checkmate::reportAssertions(coll)
res
}
sub_function1 <- function(x, y, coll)
{
checkmate::assert_numeric(x = y,
len = 1,
add = coll,
.var.name = "y")
x + y
}
sub_function2 <- function(x, z, coll)
{
checkmate::assert_character(x = z,
len = 1,
add = coll,
.var.name = "z")
paste0(round(x, 2), " ", z)
}
main_function(1, 2, "units")
main_function(1, 2, c("km", "m"))
records <- NULL
microbenchmark(
ifelse = ifelse(is.null(records), "", paste0(records, collapse=",")),
if_else = if_else(is.null(records), "", paste0(records, collapse=",")),
strict_if = if (is.null(records)) "" else paste0(records, collapse = ",")
)
x <- factor(sample(letters[1:5], 10, replace = TRUE))
microbenchmark(
base = ifelse(x %in% c("a", "b", "c"), x, factor(NA)),
dplyr = if_else(x %in% c("a", "b", "c"), x, factor(NA))
)
I apologize for the length but I wanted to make this available for review and discussion. I thought it would be good to get some feedback on ideas before I do any code work. i can break these up into separate issues when there's a specific aspect needing focus.
Also, be honest and direct in your responses. Part of this process will be transitioning me away from my
redcapAPI
mindset to theREDCapR
mindset. I promise not to have hurt feelings if you don't want to incorporate anything in this list.Observed Structure
Stand-alone functions
checkbox_choices
regex_named_captures
redcap_column_sanitize
replace_nas_with_explicit
The API Calls
This is the code I used to plot out how the existing API calls interact with each other. The blue boxes are my proposals for new functions. A child node indicates a function that is called within its parent.
_Less complex Systems _
populate_project_simple
callsredcap_project
retrieve_token_mssql
callsretrieve_credential_mssql
retrieve_credential_local
validate_no_logical
andvalidate_no_uppercase
callvalidate_for_write
New functions
sanitize_token
As of 6.9.5,
It wouldn't be hard to write a function to sanitize the token. It would look something like:
This has the same effect as the code introduce through Issue #103 (https://github.com/OuhscBbmc/REDCapR/issues/103), and has the added benefit of providing consistency of tokens into earlier version of REDCap.
api_call
It might seem silly, but if the
httr
interface ever undergoes a change, or if there's ever a decision to change the package making the web call, putting all of thePOST
calls into one function means only have to change that one function, instead of finding all of the calls in the package.redcap_read_batch
Would it simplify the interface for the user to have
redcap_read
be the driver for exporting data, and to make it a wrapper forredcap_read_oneshot
and a new functionredcap_read_batch
?redcapAPI::exportRecords
usesbatch.size = -1
to indicate reading in one shot. Something similar could be done to direct traffic between the subfunctions without affecting backward compatibility (thoughredcap_read_oneshot
would need to continue to be exported.)redcap_write_batch
Similar concept to
redcap_read_batch
. Just make it one function to call for the user.New Features
format_data
argument toredcap_read
that formats the data (similar to howredcapAPI
formats) whenTRUE
. A new functionformat_data
would be the engine to do this work.redcap_read
andredcap_write
.redcapAPI
has ameta_data = NULL
argument that allows the user to pass in the meta data data.frame if it exists. If given, there is no need to download the meta data again. IfNULL
, the meta data is downloaded. Not strictly necessary, but can save a bit of time. I'd consider this pretty low priority, because it is a feature that can be added without affect back compatibility.redcapAPI
, but the REDCap validations return some pretty good messages as well.Code Improvements
Sections of code like
can be optimized with
It isn't a big deal, (see benchmarking below, but as a package matures, I think it's a good thing to incorporate some efficiency gains)
Other Notes
RODBCext
to manage credentials, it wouldn't be hard to write up instructions for acessing credentials from any SQL database. Users would just need to know their driver. We could probably make an adaptation that pulls from a CSV using thesqldf
package. It would make a consistent style of management that would encourage best practices, even for users who don't have the benefit of SQL databases.eav
of a large data file is faster than thecsv
export. How large does a database have to be with theeav
format before the server times out? If we find theeav
to be faster, perhaps we could up the batch number from 100 to something larger (fewer calls to the API would mean less wait time for the user, but I suspect you would want to continue batching to make the server available to other users. Upping the default batch number seems like a good compromise.)I should also add that at one point I had compiled a list of the API changes announced on the group. I was making notes on what changes I might incorporate into
redcapAPI
and how those changes could affect the R package. The link to the spreadsheet is https://docs.google.com/spreadsheets/d/1NMdpb-k5nvrVxF0gvnpIfQyP4vZIpQgUxoLUKJuv2L0/edit?usp=sharing