eddelbuettel / rcppsimdjson

Rcpp Bindings for the 'simdjson' Header Library
115 stars 13 forks source link

Add 'enforce_lapply' or similar to fparse/fload #64

Closed tdeenes closed 3 years ago

tdeenes commented 3 years ago

It is a great feature of the package that one does not need to call lapply() on character vectors since fparse\fload are already vectorized. However, this also means that the output type depends on whether the input contains a single element or multiple elements, e.g.:

inp2 <- as.character(1:2)
RcppSimdJson::fparse(inp2)
# [[1]]
# [1] 1
# 
# [[2]]
# [1] 2

inp1 <- as.character(1)
RcppSimdJson::fparse(inp1)
# [1] 1

So to be on the safe side, one has to check if the input vector is of length 1, and if yes, one has to wrap the result in a list. E.g.:

safe_fparse <- function(json, ...) {
  out <- RcppSimdJson::fparse(json, ...)
  if (length(json) == 1L) out <- list(out)
  out
}
safe_fparse("1")
# [[1]]
# [1] 1

What about adding an argument to fparse/fload which enforces that each element of the input json is wrapped in a list?

eddelbuettel commented 3 years ago

I wrestle with that in other contexts too where we have data downloaders. With multiple requests it makes sense to returns list around individual sub-components. The question then is what to do for N==1. I usually also 'unlist' if the length is one for convenience. But I hear you -- it may make sense to add a toggle to not do that here so that we know we always have a list.

Thoughts, @knapply ?

knapply commented 3 years ago

My gut is telling me safe_fparse() (or an equivalent that doesn't imply fparse() is unsafe 😅) will be fairly niche and better left to the user. I'd probably write it something like the following.

fparse_to_list <- function(json, ...) {
  if (length(json) == 1L) lapply(json, RcppSimdJson::fparse, ...) 
  else RcppSimdJson::fparse(json, ...)
}

I recall considering having fparse()/fload() variants that would have been explicitly intended for multi-element vectors (which would always return a list), but decided the "vectorized" approach of having one function seemed more idiomatically R (for better, but maybe for worse). But, we do diverge from the behavior of something like strsplit() (always returning a list) because... 1) the majority of usage will be for single characters 2) you're far more likely to know the length of the vector you're passing to fparse()/fload() than the exact structure of the JSON... meaning you're almost always going to have to do some post-parse-processing for anything but trivial schemata (or those than be parsed directly to a "tidy" data.frame).

Another approach I recall considering was to provide a parser= parameter. The user can then create a parser object (externalptr) that can be passed to fparse()/fload() to reuse the parser. The use pattern would then look like something like the following...

my_parser <- parser()
lapply(many_json, fparse, parser =  my_parser)

The issue here is that going to a full-on Rcpp modules interface would make more sense for anything that feels object-oriented-ish -- we'd want to be able to access or modify the parser's state (alive? capacity? max_depth?).

All of that said, I'm not exactly opposed to adding the option, I'm just not convinced it's worthwhile. I'll ponder some more, but @tdeenes do you have a more involved use-case where this has been a pain to deal with?

tdeenes commented 3 years ago

Thanks for the details, @knapply . As for my use case: nowadays more and more RDMS-s support JSON-encoded columns (for example Snowflake has dedicated data types for it). So if you query a table which has a JSON column, you will get it as a character vector in R, which can be parsed to a list vector very efficiently by fparse(). Or if you combine it with data.table and rbindlist(), you can easily parse it to a data.table having as many rows as the length of the JSON vector (probably this is what you meant by "tidy" data.frame).

If someone expects multi-element JSON vectors, and fails to cover the one-element case by proper unit tests, the current defaults of fparse() can easily lead to disastrous bugs downstream. Fortunately I was getting the error message in the development phase and could easily figure out what caused it, but I was not expecting the current (non-type safe) behaviour.

So in a nutshell, reflecting directly upon 1) and 2):

  1. I am sure you implemented and documented fparse() as a vectorized function with a good reason - the pure existence of the vectorized nature of it proves that this is not a niche use case.
  2. You are absolutely right, but I do not get the point why it would support on anti-strsplit behaviour. Or on the contrary, IMO it fully supports the view that your strsplit() example is a perfect analogue: just as with strsplit, you do not know how many "parts" each element of your vector consist of, so it is much more consistent to return the result as a length-consistent list. If you know that you have a single element, you just call result[[1]] in the end, or do whatever customization your ETL process requires.
knapply commented 3 years ago

As the day went on, the inconsistency has seemed like more of a mistake. Thanks for talking it out; I'm sold.

@eddelbuettel, in terms of API, this can be accomplished either via an additional parameter like @tdeenes initially suggested, or as separate functions that wrap fparse()/fload() like the examples mentioned here (probably w/ a preference for explicit arguments instead of ...).

Before diving in, are there particularly strong feelings either way?

eddelbuettel commented 3 years ago

Or (and just thinking out loud) fparse() and fload() can (and likely will) easily add another boolean toggle. We could then also add (if we feel it makes sense) another layer in liststablefparse() (not a great name, but you get the idea). We could also do the second step 'later'.

tdeenes commented 3 years ago

Actually I do favor the additional parameter-based approach (which would default to the current behavior first and then switched to the consistent one after a deprecation period), but do not like the name that I suggested in the title of this thread. enforce_lapply sounds as if fparse used the much less efficient lapply-implementation. Maybe preserve_length or return_list are more explicit and also do not interfere with the max_simplify_lvl argument (which is orthogonal to the recent issue).

eddelbuettel commented 3 years ago

as_list ? return_list ?

preserve_length sounds wrong as we weren't about to add or remove elements. enforce sounds too harsh too.

knapply commented 3 years ago

always_list?

knapply commented 3 years ago

Try https://github.com/eddelbuettel/rcppsimdjson/pull/65 on for size...

# remotes::install_github("eddelbuettel/rcppsimdjson@issues/64")

inp2 <- as.character(1:2)
RcppSimdJson::fparse(inp2)
##> [[1]]
##> [1] 1
##> 
##> [[2]]
##> [1] 2
RcppSimdJson::fparse(inp2[[1L]], always_list = TRUE)
##> [[1]]
##> [1] 1
RcppSimdJson::fparse(c(some_name = inp2[[1L]]), always_list = TRUE)
##> $some_name
##> [1] 1