OuhscBbmc / REDCapR

R utilities for interacting with REDCap
https://ouhscbbmc.github.io/REDCapR
Other
112 stars 46 forks source link

Unexpected Behavior: Empty dataframe when reading from project with special characters #354

Closed lrasmus closed 2 years ago

lrasmus commented 2 years ago

We have loaded data that contains non-ASCII characters. When calling read, the data frame is coming back empty:

project <- redcap_project$new(redcap_uri = uri, token = token)

#'title' and some other fields cannot be read due to encoding issues, read only abstract and full text review information
df  <- project$read(fields = c(....))

I have traced this back that the httr::POST is correctly returning data from the API, but the call to httr::content is using the default UTF-8 encoding. The conversion fails, and the data is then returned as empty.

In the current read calls, it doesn't appear to allow overriding the encoding. I am happy to propose a fix and submit a PR, but was wondering if we want the user to always explicitly set the encoding, or if the package should try to detect the encoding using something like the uchardet package:

detect_raw_enc(response$content)
wibeasley commented 2 years ago

@lrasmus, thanks for digging into the problem.

  1. I completely agree with your dissection and that redcap_read_oneshot() (and redcap_read(), which calls redcap_read_oneshot()) could accept an encoding argument and pass it to the part of kernel_api() that you highlighted. If this enhancement would help you, then I'd love to accept a PR.
  2. Please send me a CSV that I can upload to a new test project for the REDCapR unit tests.
  3. So far, I haven't really encountered any non-ascii values (at least none that were intentionally non-ascii). What do you (and anyone else) think about allowing an automatic detection of encoding? Do you guys think it's an important capability (instead of forcing people to decide & specify each time they want something non-UTF-8)?
  4. If so, my initial preference is
    1. to allow redcap_read_oneshot() to pass through an encoding value (i.e., the PR you described above)
    2. the defaults remain "UTF-8". (I'm guessing the encoding can sometimes be falsely detected, so I'd prefer to be locked in by default)
    3. When the new encoding argument is explicitly set to NULL, then uchardet decides. Where does this happen --inside the kernel_api() function?
  5. I'm always reluctant to increase dependencies, but currently, the uchardet package would add only two dependencies: uchardet and indirectly tinytest (which doesn't have any dependencies outside parallel & utils, which install with R). All the other uchardet dependencies are existing REDCapR dependencies.
  6. Does anyone have ideas or experience for the unit tests? I experimented a little a few years ago, and got derailed because the same REDCap project represented Russian characters, depending if the client was Linux or Windows. I never came back to it. @teigentler reported similar discrepancies with German characters in #296.
  7. This doesn't seem like something that should require probablistic detection. Is encoding a setting in REDCap or even MySQL? If so, maybe a new API function should be added to REDCap so it reports the encoding (which then is used for subsequent executions of redcap_read_oneshot())? It should be the encoding of the column redcap_data.value. I saw some chatter on the forums about recent changes to MySQL & Maria & REDCap.
wibeasley commented 2 years ago

@lrasmus, I implemented your idea and added the http_response_encoding parameter to redcap_read() and redcap_read_oneshot().

Judging from the lack of responses, I'm guessing you and others don't feel too strongly for the need to add an automatic detection mechanism for encoding, so I went with the simpler route that doesn't require another package dependency.

If you or anyone else would like to open a new issue (and reference this one), I'm happy to revisit the issue. If you do, please send me a few simple-ish dictionaries & PHI-free datasets that can be included in the test suite.