OuhscBbmc / REDCapR

R utilities for interacting with REDCap
https://ouhscbbmc.github.io/REDCapR
Other
112 stars 45 forks source link

back to regexes in `checkbox_choices()` #504

Closed wibeasley closed 11 months ago

wibeasley commented 11 months ago

@BlairCooper, maybe a hybrid of the two approaches should replace #500. (These attempts are still in dev branches --nothing has been pulled into the main branch yet.)

Maybe we still use a two-stage approach, where sptrsplit() uses pipes to separate the choices/rows. Then use our own regex to separate id from label.

I gave up on readr:read_csv() for this purpose because I couldn't get it working if the label contained commas. I'd have to preprocess the entries to enclose the label in quotes for read_csv() to operate as intended --but we'd need a regex to do that. So we might as well use a single regex to separate them into columns and avoid read_csv(). Tell me if anyone disagrees or sees something I'm overlooking.

@BlairCooper, here's the heart of the function's current version (still on a dev branch). I tried to avoid using base::trimws() and had a working regex (ie, ^\\s*+(.+?),\\s*+(.*?)\\s*$). But I couldn't figure out how to drop empty rows, like your scenario in #502.

You mentioned that this approach is 0.5sec vs 0.3sec. I think I'm ok with that because moving data across the network always dwarfs the computation time on the client machine.

  pattern <- "^(.+?),\\s*+(.*)$"

  select_choices %>%
    strsplit(split = "|", fixed = TRUE) %>%
    magrittr::extract2(1) %>%
    base::trimws() %>%
    tibble::as_tibble() %>% # default column name is `value`
    dplyr::filter(.data$value != "") %>%
    dplyr::transmute(
      id    = sub(pattern, "\\1", .data$value, perl = TRUE),
      label = sub(pattern, "\\2", .data$value, perl = TRUE),
    )

Are there any scenarios this approach can't handle correctly? The only outstanding case is if the label has pipes (#503). In theory the strsplit() could use a regex to distinguish pipes between choices versus pipes within labels. I haven't figured out how though, because the id is so flexible (eg, letter(s) & number(s)).

@BlairCooper or anyone else, any thoughts on performance, accuracy, or maintainability?

BlairCooper commented 11 months ago

Looks like this works for all the scenarios except pipes in the label. For that to work it seems like you'd need to first split on the pipe, then check if each entry looks like a value,label pair (without any trimming). If it doesn't look like a pair its part of the label for the preceding option so add it back on.

Of course, throw in commas and pipes into a label and there's no hope.