CHOP-CGTInformatics / REDCapTidieR

Makes it easy to read REDCap Projects into R
https://chop-cgtinformatics.github.io/REDCapTidieR/
Other
33 stars 8 forks source link

[BUG] `yesno` fields are converted to `NA` when the field contains missing data codes #181

Closed pbilodeau94 closed 7 months ago

pbilodeau94 commented 7 months ago

Current Behavior

When using read_redcap, all yes/no variables are all listed as N/A. Dummy factor variables for multiple checkboxes get pulled correctly as TRUE/FALSE.

How to Reproduce the Bug:

nmosd <- read_redcap(url, token) %>% make_labelled(format_labels = fmt_strip_trailing_colon) %>% bind_tibbles()

ezraporter commented 7 months ago

Thanks for the bug report! I wasn't able to reproduce this issue with the test REDCap projects we have which include yesno variables. Could you try to provide a minimal reproducible example? That could look like:

  1. The token and URI for an existing redcap project that shows the issue if you're willing and able to share it.
  2. The dataframe output of the following two calls on a redcap project that shows the issue:
REDCapR::redcap_metadata_read(redcap_uri = [YOUR_URI], token = [YOUR_TOKEN])

REDCapR::redcap_read_oneshot(
    redcap_uri = [YOUR_URI],
    token = [YOUR_TOKEN],
    export_survey_fields = TRUE,
    export_data_access_groups = TRUE,
    guess_max = .Machine$integer.max
)

It would also be helpful if you could provide the results of running sessionInfo() after running the code to produce the bug.

pbilodeau94 commented 7 months ago

Thanks for your quick response! I am not able to share the token and URL since there is PHI in the database. When I run this, it looks like those variables do accurately get pulled, unlike when using read_redcap.

ezraporter commented 7 months ago

Are you able to share what those results look like after manually removing PHI? Alternatively, if you have the ability to create REDCap projects at your institution, are you able to create an example that demonstrates the bug without PHI?

Since it sounds like downloading the data from REDCap isn't where the bug is originating, it would be helpful to see your sessionInfo() so I can check if this is specific to particular package versions.

pbilodeau94 commented 7 months ago

Here’s what it looks like with read_redcap. mri_relapse and gad_relapse are N/A for all record IDs. [A screenshot of a computer Description automatically generated]

Here’s what it looks like when using REDCapR directly. data_mri_relapse and data_gad_relapse are accurately pulled. [A screenshot of a computer Description automatically generated]

Here’s sessionInfo

R version 4.3.3 (2024-02-29) Platform: aarch64-apple-darwin20 (64-bit) Running under: macOS Sonoma 14.3.1

Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0

locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York tzcode source: internal

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] Hmisc_5.1-2 redcapAPI_2.8.4 DemographicTable_0.1.6 scales_1.3.0 ggrepel_0.9.5 [6] officer_0.6.5 flextable_0.9.5 broom_1.0.5 kableExtra_1.4.0 skimr_2.1.5 [11] labelled_2.12.0 knitr_1.45 REDCapR_1.1.0 REDCapTidieR_1.0.0 lubridate_1.9.3 [16] forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4 purrr_1.0.2 readr_2.1.5 [21] tidyr_1.3.1 tibble_3.2.1 ggplot2_3.5.0 tidyverse_2.0.0

loaded via a namespace (and not attached): [1] gridExtra_2.3 rlang_1.1.3 magrittr_2.0.3 e1071_1.7-14 compiler_4.3.3 [6] getPass_0.2-4 systemfonts_1.0.6 vctrs_0.6.5 httpcode_0.3.0 pkgconfig_2.0.3 [11] crayon_1.5.2 fastmap_1.1.1 backports_1.4.1 ellipsis_0.3.2 utf8_1.2.4 [16] promises_1.2.1 rmarkdown_2.26 tzdb_0.4.0 haven_2.5.4 ragg_1.3.0 [21] bit_4.0.5 xfun_0.42 jsonlite_1.8.8 later_1.3.2 uuid_1.2-0 [26] parallel_4.3.3 cluster_2.1.6 R6_2.5.1 stringi_1.8.3 RColorBrewer_1.1-3 [31] rpart_4.1.23 Rcpp_1.0.12 assertthat_0.2.1 lobstr_1.1.2 base64enc_0.1-3 [36] nnet_7.3-19 httpuv_1.6.14 timechange_0.3.0 tidyselect_1.2.1 rstudioapi_0.15.0 [41] yaml_2.3.8 curl_5.2.1 shiny_1.8.0 withr_3.0.0 askpass_1.2.0 [46] evaluate_0.23 foreign_0.8-86 proxy_0.4-27 zip_2.3.1 xml2_1.3.6 [51] formattable_0.2.1 pillar_1.9.0 checkmate_2.3.1 generics_0.1.3 vroom_1.6.5 [56] hms_1.1.3 munsell_0.5.0 chron_2.3-61 xtable_1.8-4 class_7.3-22 [61] glue_1.7.0 gdtools_0.3.7 tools_4.3.3 gfonts_0.2.0 data.table_1.15.2 [66] grid_4.3.3 colorspace_2.1-0 repr_1.1.6 htmlTable_2.4.2 Formula_1.2-5 [71] cli_3.6.2 labelVector_0.1.2 textshaping_0.3.7 fontBitstreamVera_0.1.1 fansi_1.0.6 [76] viridisLite_0.4.2 svglite_2.1.3 keyring_1.3.2 gtable_0.3.4 digest_0.6.35 [81] fontquiver_0.2.1 crul_1.4.0 htmlwidgets_1.6.4 htmltools_0.5.7 lifecycle_1.0.4 [86] httr_1.4.7 mime_0.12 fontLiberation_0.1.0 openssl_2.1.1 bit64_4.0.5

From: Ezra Porter @.> Date: Monday, March 18, 2024 at 4:37 PM To: CHOP-CGTInformatics/REDCapTidieR @.> Cc: pbilodeau94 @.>, Author @.> Subject: Re: [CHOP-CGTInformatics/REDCapTidieR] [BUG] yesno variables are pulled as NA (Issue #181)

Are you able to share what those results look like after manually removing PHI? Alternatively, if you have the ability to create REDCap projects at your institution, are you able to create an example that demonstrates the bug without PHI?

Since it sounds like downloading the data from REDCap isn't where the bug is originating, it would be helpful to see your sessionInfo() so I can check if this is specific to particular package versions.

— Reply to this email directly, view it on GitHubhttps://github.com/CHOP-CGTInformatics/REDCapTidieR/issues/181#issuecomment-2004941801, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BGXP2ZQ575T4AN5RNZY4H63YY5GANAVCNFSM6AAAAABE4FK7W6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBUHE2DCOBQGE. You are receiving this because you authored the thread.Message ID: @.***>

ezraporter commented 7 months ago

Thanks! I don't think the screenshots made it through. Can you try to resend them? You may need to use the github.com UI instead of email for it to work

pbilodeau94 commented 7 months ago

read_redcap : Screenshot 2024-03-18 at 5 22 34 PM

RedcapR: Screenshot 2024-03-18 at 5 23 46 PM

ezraporter commented 7 months ago

Ah okay the issue is with those "UNK" values, which I guess your project is getting through the Missing Data Codes customization. I have to think a little more about how we should handle these but agree the current behavior isn't right so we'll implement some fix.

If you have a second to think about it, what would your expectation as a user be for a yesno field that has UNK values? Convert the field to a factor with levels Yes, No, UNK? Convert UNK to NA and have the field remain TRUE/FALSE? Something else?

pbilodeau94 commented 7 months ago

Ah yes that makes sense. Yes, I think converting UNK to NA and keeping it as TRUE/FALSE makes the most sense.

From: Ezra Porter @.> Date: Monday, March 18, 2024 at 5:54 PM To: CHOP-CGTInformatics/REDCapTidieR @.> Cc: pbilodeau94 @.>, Author @.> Subject: Re: [CHOP-CGTInformatics/REDCapTidieR] [BUG] yesno variables are pulled as NA (Issue #181)

Ah okay the issue is with those "UNK" values, which I guess your project is getting through the Missing Data Codeshttps://kb.wisc.edu/smph/informatics/page.php?id=108107 customization. I have to think a little more about how we should handle these but agree the current behavior isn't right so we'll implement some fix.

If you have a second to think about it, what would your expectation as a user be for a yesno field that has UNK values? Convert the field to a factor with levels Yes, No, UNK? Convert UNK to NA and have the field remain TRUE/FALSE? Something else?

— Reply to this email directly, view it on GitHubhttps://github.com/CHOP-CGTInformatics/REDCapTidieR/issues/181#issuecomment-2005112835, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BGXP2ZTCAQIFKTYXCBGLQRDYY5PCTAVCNFSM6AAAAABE4FK7W6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBVGEYTEOBTGU. You are receiving this because you authored the thread.Message ID: @.***>

skadauke commented 7 months ago

Since Missing Data Codes (MDCs) can be applied to any field type, I think we need to consider what we want to happen for each of the various field types if a MDC is enabled here. The trickiest would be the boolean-type fields (yes/no, true/false, multiple choice??). The two options I see are 1) we no longer handle these fields as boolean but turn them into factors or 2) keep them boolean and recode MDC values as NA.

I guess the default behavior could #2. However since that is also how an empty value is coded that could cause data loss which shouldn't happen silently. So I think we might want to have read_redcap() throw a warning when this is detected suggesting that the user define the desired behavior. This could be done with an optional arg to read_redcap(), e.g. boolean_mdc_fields_as_factor?

Thoughts?

rsh52 commented 7 months ago

So I think we might want to have read_redcap() throw a warning when this is detected suggesting that the user define the desired behavior. This could be done with an optional arg to read_redcap(), e.g. boolean_mdc_fields_as_factor?

IMO this sounds better than (1), I don't think the default should be to change the data types. Sort of similar to how we went about #177, this isn't exactly a typical setup so if a user wants to change something it's good to do so consciously.

Handling this is going to be non-trivial, to my knowledge there's no easy way to detect that missing data codes exist (unless I'm totally overlooking something!). I made a test database to verify, below is a sample data and metadata output for a yesno and a text field with an UNK MDC:

Browse[2]> db_data
  record_id yesno text form_1_complete
1         1     1 text               2
2         2     0 text               2
3         3   UNK  UNK               0
Browse[2]> db_metadata
# A tibble: 3 × 18
  field_name form_name section_header field_type field_label select_choices_or_calculat…¹ field_note
  <chr>      <chr>     <chr>          <chr>      <chr>       <chr>                        <chr>     
1 record_id  NA        NA             text       Record ID   NA                           NA        
2 yesno      form_1    NA             yesno      YesNo       NA                           NA        
3 text       form_1    NA             text       Text        NA                           NA        
# ℹ abbreviated name: ¹​select_choices_or_calculations
# ℹ 11 more variables: text_validation_type_or_show_slider_number <chr>, text_validation_min <chr>,
#   text_validation_max <chr>, identifier <chr>, branching_logic <chr>, required_field <chr>,
#   custom_alignment <chr>, question_number <chr>, matrix_group_name <chr>, matrix_ranking <chr>,
#   field_annotation <chr>

With our current setup, I suppose the best bet is to check each variable type in the metadata, then check for values that don't coincide with expected values. I'm not sure how we would do this for plain text fields. This may also add more processing time, which supports not having this as a default behavior.

pbilodeau94 commented 7 months ago

There’s no way to detect project-level MDCs using the API. The codebook in the metadata has information on which MDCs were enabled, but not at the variable level. Not sure exactly how your API call is set up (e.g whether there’s a missing data function), but the easiest way may be to ask the user to specify which MDCs were used in the specific project as a custom function. From: Rich Hanna @.> Date: Tuesday, March 19, 2024 at 9:10 AM To: CHOP-CGTInformatics/REDCapTidieR @.> Cc: pbilodeau94 @.>, Author @.> Subject: Re: [CHOP-CGTInformatics/REDCapTidieR] [BUG] yesno fields are converted to NA when the field contains missing data codes (Issue #181)

So I think we might want to have read_redcap() throw a warning when this is detected suggesting that the user define the desired behavior. This could be done with an optional arg to read_redcap(), e.g. boolean_mdc_fields_as_factor?

IMO this sounds better than (1), I don't think the default should be to change the data types. Sort of similar to how we went about #177https://github.com/CHOP-CGTInformatics/REDCapTidieR/pull/177, this isn't exactly a typical setup so if a user wants to change something it's good to do so consciously.

Handling this is going to be non-trivial, to my knowledge there's no easy way to detect that missing data codes exist (unless I'm totally overlooking something!). I made a test database to verify, below is a sample data and metadata output for a yesno and a text field with an UNK MDC:

Browse[2]> db_data

record_id yesno text form_1_complete

1 1 1 text 2

2 2 0 text 2

3 3 UNK UNK 0

Browse[2]> db_metadata

A tibble: 3 × 18

field_name form_name section_header field_type field_label select_choices_or_calculat…¹ field_note

1 record_id NA NA text Record ID NA NA 2 yesno form_1 NA yesno YesNo NA NA 3 text form_1 NA text Text NA NA # ℹ abbreviated name: ¹​select_choices_or_calculations # ℹ 11 more variables: text_validation_type_or_show_slider_number , text_validation_min , # text_validation_max , identifier , branching_logic , required_field , # custom_alignment , question_number , matrix_group_name , matrix_ranking , # field_annotation With our current setup, I suppose the best bet is to check each variable type in the metadata, then check for values that don't coincide with expected values. I'm not sure how we would do this for plain text fields. This may also add more processing time, which supports not having this as a default behavior. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***>
ezraporter commented 7 months ago

It looks like the missing data codes are returned in the Export Project Info API method so we theoretically could get them. REDCapR::redcap_project_info_read() returns them but it's locked up in the dev version. I would very much like to avoid getting them but I'm not sure we have a good choice.

There's really only 3 cases we currently have with processing fields of different types. For the first 2 it's pretty trivial to remove missing data codes without knowing what they are. For the third it's basically impossible. That leaves us in a bind if we want a solution that:

  1. Treats all data types equally w/r/t missing data codes. Either convert them to NA everywhere or nowhere
  2. Doesn't need to know the missing data codes ahead of time

I think the least of all evils is to convert to NA where we can without knowing the codes for now (cases 1 and 2) and finish the job once REDCapR 1.2.0 is on CRAN (case 3). If we hate that transitional plan we query the API directly as a holdover.

Case 1: yesno, truefalse, checkbox

Current behavior: Convert to logical with as.logical()

Possible new behavior: Check for non-1/0/NA and warn if found. Convert those to NA then convert to logical.

Case 2: dropdown, radio

Current behavior: Convert to factor using levels read from metadata. (This implicitly converts missing codes to NA since they aren't in the metadata.)

Possible new behavior: Check for data values not in the levels read from metadata and warn if found. Convert to factor, implicitly converting missing codes to NA.

Case 3: Everything else

Current behavior: Do nothing

Possible new behavior: For now, do nothing. When REDCapR 1.2.0 is on CRAN, read the MDCs from project info and convert to NA with a warning.

We can add an option to silence warnings so users can shut them off when they don't have control of the REDCap project.

rsh52 commented 7 months ago

After further discussion some additional behavioral things to note for MDCs:

skadauke commented 7 months ago

Considering that I suspect only a small subset of REDCap databases use MDCs, possibly breaking changes to the supertibble data model, and likely performance hit with robustly supporting MDCs, I think what Ezra outlines above is a reasonable solution for now (i.e. implement cases 1 and 2). We can consider implementing case (3) after someone demonstrates the need for this by opening a feature request. We can write a warning message that states that we are not currently supporting MDCs but if this is needed they should file an issue on our GitHub page.

ezraporter commented 7 months ago

I noticed when working on this that we currently don't convert fields to logical when raw_or_label = "raw". Do we like that behavior? I don't have strong opinions either way.

rsh52 commented 7 months ago

I noticed when working on this that we currently don't convert fields to logical when raw_or_label = "raw". Do we like that behavior? I don't have strong opinions either way.

Like how yesno fields are 1/0 for "raw" but TRUE/FALSE for "label"? I also don't feel strongly, 1/0 is what REDCap itself uses for raw values for these fields so it seems ok to keep as is.

image

ezraporter commented 7 months ago

Like how yesno fields are 1/0 for "raw" but TRUE/FALSE for "label"? I also don't feel strongly, 1/0 is what REDCap itself uses for raw values for these fields so it seems ok to keep as is.

Yeah okay I agree and it actually will keep our code simpler to keep as is. PR coming soon but I'm leaning towards also keeping the missing data codes when raw_or_label = "raw". That has the advantage of keeping our code simpler and giving users a way to get the MDCs if they really want.

ezraporter commented 7 months ago

@pbilodeau94 when you have a chance let us know if #182 solved your issue. You can try it out by installing the dev version

pbilodeau94 commented 7 months ago

Yes, looks like it's all set now! Thank you!!