Export of checkbox items and `_complete` fields

januz commented 2 years ago

Hi @wibeasley, I assume this is not an {REDCapR}-specific problem but a general problem with the API but I hope that you know ways how to properly deal with these issues:

When exporting a checkbox field that does not have a selection, all associated ___[1-n] fields are set to 0; this is also true if the complete form the checkbox items is part of, is empty:
- This is quite misleading because it looks as though the item is not incomplete.
- This also causes me headaches when exporting data for our study for sharing (thousands of fields, quite a few of them are checkbox fields) because I'd rather not export data for cases that did not provide data for this item or the complete form/table that the items is a part of.
When exporting a form that does not have a completion status for a given record, the _complete field is set to 0:
- This has recently caused issues for me because I imported data that I had assembled using {REDCapR} into another system using the file import functionality and set forms that should have remained with no completion status to "incomplete" (0)

I feel like that in both cases the "proper" export would be to have NAs instead of 0s. That would make it easy to differentiate cases

who have filled in a checkbox item from those who have not (or have not provided any data for the form the checkbox item is part of)
who have no status for a given form from those who have an actual "incomplete" status

Thanks for your input!

wibeasley commented 2 years ago

I agree these two aspects aren't ideal. I think there is an accepted hack for each one though:

I think what some people do is create a hidden datetime field (in each form) that is automatically populated with @NOW. So if the form hasn't been touched, this datetime field will be null/blank.
Does #386 help? It's been a few days since I did that. I think a gray circle is equivalent to a null. Double-check me though.

januz commented 2 years ago

Thanks so much, @wibeasley, the exportBlankForGrayFormStatus API option should solve issue 2 (only problem for me personally is that it is only available in newer REDCap versions).

As for 1, the method you describe works well for forms that are completely empty (using the exportBlankForGrayFormStatus option and checking for NAs in the _complete field would also work).

But there are cases where just the checkbox field is missing but the rest of the form has data. In those, one would not be able to easily differentiate between cases that did or did not fill in information without having to consider all ___[1-n] columns associated with a given checkbox field and summing them up (if 0, the field is "missing"). It would be ideal / more appropriate to just have NAs for all the ___[1-n] columns. I wonder why the API was designed in this way...

Thanks for your help!!

wibeasley commented 2 years ago

I wonder why the API was designed in this way...

Do you have access to the database, specifically the redcap_data table? If not, I think you can determine the same thing with the API playground. See if that record exists for the checkboxes. If it doesn't, I think we have a window.

If the record doesn't exist in the table, we could extract the data as an eav. Essentially REDCap's PHP code doesn't widen/pivot it --REDCapR's R code does. I've been working off and on with this semi-hidden function: https://ouhscbbmc.github.io/REDCapR/reference/redcap_read_oneshot_eav.html

Try this: go to REDCap's API playground and select a single record (where you'd like the checkboxes to be null, not zero). Also make sure "eav" is selected instead of "flat".

januz commented 2 years ago

@wibeasley Thanks so much, I didn't know about the eav option for API exports. When I try it out in the API playground, the output looks like one would expect, i.e., no rows if no options were selected for a given checkbox.

I cloned the development version of this repository but my first attempts with redcap_read_oneshot_eav() have been unsuccessful. It seems to pull the data correctly (the printed raw text output is as expected) but the postprocessing into a tibble fails (error: "The REDCap read failed")

wibeasley commented 2 years ago

Oh good. If the rows aren't there, that makes it easier for the code to decide if they should be false or NA. Will you please create & attache a test project (dictionary and dataset) that demonstrates the problem & desired behavior? Maybe ~5 rows with ~4 checkboxes and a comment field. The comment field includes a statement that reinforces your desired outcome (eg, "ideally these checkboxes are returned as NAs" and "the checkboxes are returned as TRUE/FALEs".

Do you want me to work on the EAV side, or do you want another crack? This is good timing because the EAV pivoting will really benefit from the latest REDCapR function, `redcap_metadata_coltypes() (#405).

januz commented 2 years ago

Hi @wibeasley Thanks so much for working with me! I put together the project that you described and am attaching here

the data dictionary
the (wrongly) exported data (which mimics what one gets through the API without choosing EAV, see below)
an XML file of the project setup and data if you would like to restore it on your system

https://drive.google.com/file/d/17RyeCofnVPrWr5RgQkklkLdVhvDUmD_H/view?usp=sharing

In addition to the comments I left in the descriptive field, here the summary

record 1 has data for both checkboxes and is thus showing up correctly
record 2 has only entries in the second checkbox but all ___ variables associated with the first checkbox have 0s while they should be NA
record 3 has only entries in the first checkbox but all ___ variables associated with the second checkbox have 0s while they should be NA
record 4 has no data at all in the form (i.e., the form was never saved). They should have NAs for all checkbox items and the _complete variable but all ___ variables associated with the both checkboxes have 0s and the _complete variable is set to 1

It would be fantastic if you could steer the development. I played around with your current version a little bit yesterday but did not make any progress on improving it. In addition, I came down with COVID today so I won't be at my best physically and cognitively for a bit :/ The new function that you mentioned sounds great. From my testing yesterday, setting the correct data types seems one of the more difficult and very time consuming parts of the function, especially with a data dictionary as huge as ours (>38,000 variables). I guess that was one of your motivations for developing it? Thanks so much!

wibeasley commented 2 years ago

Sorry about covid. Don't worry about responding soon. I'm writing some of this to remind myself in the future.

Your description above is perfect and helped me see the weakness with the REDCap extract process (independent of the API or R layers) I was expecting the exported csv to have zeros (instead of nulls/blanks). But the xml does too (thanks for sending it, btw.

When I saw the xml, I thought I'd have to delete that record and recreate it (without touching the 2nd form). And I'm surprised that the playground (loaded from the xml) doesn't have those rows.

januz commented 2 years ago

@wibeasley it's indeed surprising that the XML also contains the 0s! Let me know if you need further input of if I can help with testing or anything.

januz commented 1 year ago

@wibeasley I am happy to see that you have a few commits associated with this issue. Did you make progress on the implementation? Is there something / a branch that I could try out and/or something I could help with? I will soon have to export data from our humongous REDCap project and the EAV functionality (and associated correct export, or non-export, of checkbox item data) would be amazing to have. Thank you!

wibeasley commented 1 year ago

I forgot exactly where I left it last week. Whatever it was should be on the dev branch. I think I had finished augmenting the metadata function so that it returned info the eav function needed to pivot.you might want to poke around if your deadline is soon. I'm at REDCapCon for another two days. I hope to get back to this early next week.

januz commented 1 year ago

Ah cool, have a great time at REDCapCon! I'll try to check it out your current implementation soon.

wibeasley commented 1 year ago

@januz, I'm working on this now. I did some things in other functions (#427) that should make this easier eventually, but it would be hard for someone else to pick it up right now. Gimme a few days.

januz commented 1 year ago

@wibeasley Great, thank you so much!! I'm looking forward to testing it.

januz commented 1 year ago

@wibeasley Sorry to bother you again with this but it looks as though you made some progress with this issue? Fortunately, our data release has been pushed out so I will have some extra time before I need this functionality but I'm very interested in using it as I feel like the current export of checkbox items is misleading... Thanks for all your work on this!

wibeasley commented 1 year ago

note: when @januz & I met yesterday, we decided that when coalescing checkboxes, the form's *_complete value should be considered. If it's missing, then the checkbox should not be coalesced w/ false.

https://github.com/OuhscBbmc/REDCapR/blob/e267521f513790744b81bc01a0010ae4a22236b1/playgrounds/eav-playground-2.R#L117-L118

I think the tricky part would be to account for the project's & the form's value set of record_id, redcap_event_name, redcap_repeat_instrument, and redcap_repeat_instance. We're hoping to do this entirely with metadata-like calls, and not rely on the data itself.

januz commented 1 year ago

Hey @wibeasley, I finally got around to working on this. I think that I have figured out a way that takes into account the different project settings (longitudinal, repeating, combination of both, or neither). I wasn't able to do so completely independently of the data. For example, the repeat status of a project does not determine whether the redcap_repeat_... columns are included in the eav export but the requested data does. But I think that my solution could be used to develop a general export function that allows to export checkbox data in a more correct format.

On that note, I think that I previously didn't communicate the goal clearly enough (or this might have gotten mixed up because when I initiated this issue I was in addition talking about the _complete fields):

The goal is not to set all ___ fields resulting from checkbox items to NA if the form completion status is missing, i.e., the whole form has not been done by a participant
The goal is to set all ___ fields resulting from each checkbox item to NA if there has been no box checked for that specific item, i.e., independent of whether the form was completed or not

The standard API exports sets all ___ fields to 0/FALSE, independent of whether a box has been checked for that item or whether it was even seen by a participant. What I would like to achieve is that

if at least one checkbox has been selected, all other ___ fields for that item are set to 0/FALSE (standard behavior)
if none of the checkbox items has been selected/touched, all ___ fields for that item are set to NA

I feel like this is a more meaningful representation of the data that doesn't suggest one can assume that each of the ___ fields should be considered 0/FALSE for that participant but shows it as missing data.

I tried to push my changes (I create a new playground file and, based on one of your function, added a function that gets the repeating forms and event associations from the API) but I don't have the privilege to. I will send you an email with those two files and tokens for 4 different REDCap projects I set up for testing

neither longitudinal nor repeating
longitudinal
repeating
both longitudinal and repeating

I set up all projects in a way that they use the same data dictionary with a setup form with the record_id and three other forms one of which contains checkbox items. Each has data for 4 records

the first one has all items and forms completed
the second and third one have different forms from each other completed, half completed, or missing
- the half completed ones leave out responses for different checkbox items
the fourth one has only a record_id but doesn't have any other data

At the bottom of the playground script I use compareDF to show the difference between the standard export and the processed eav export. For example, this shows the difference for the project that is neither longitudinal nor repeating:

This shows the differences between the different export styles:

no difference for subject 1 who has filled in all items
subject 2 and 3 have NAs for checkbox items that were not filled in for the eav export and FALSE for the standard export
subject 4 does not have any data besides the record_id for the eav export but FALSEs all over for the standard export

I hope this is useful. It would be great if you could test on your end and let me know what you think. Thanks so much!

januz commented 1 year ago

@wibeasley Just an update as I've been using the above laid out approach and a variant of the code that I sent you to export data from our massive REDCap project (for example, I added some batching code that divides up the returned data into batches of around 1000 rows each, based on the requested participants events fields).

Overall, it works great but I've run into a few issues

I wasn't expecting that redcap_event_instruments() does not export all arms by default; I changed to using redcapAPI::exportMappings() for now and left an issue here to request changing the default behavior: https://github.com/OuhscBbmc/REDCapR/issues/482
I ran into a few cases where a form had previously been associated with an event (i.e., data for that form and event exists) but is no longer associated with the event for some reason; in these cases the approach of relying on getting the event-form mappings to assemble eav_possible doesn't work; we might need to change this or at least catch these issues and warn
Curiously, I had to learn that in just a few cases (out of a massive project with ~40,000 fields and 12,000 participants * 13 events), the EAV API method exports returned two values for the same participant/event. I don't know why these dual entries existed in the first place but I opted to delete the value that wasn't the current one (the one displayed in the interface and returned by the standard API method) in the database; the order of the duplicates was regular, so one could probably just ignore the older entry when assembling the output data frame

lz100 commented 8 months ago

Hi @wibeasley, @januz and I put together a working function for checkboxes. Since we are not familiar with your code base and internal functions, it will be better for you to take a look, and I hope it will be helpful. The idea is to take this function as your internal function and give an option to users to let them choose what to do with the checkbox in the redcap_read functions.

currently, it works like a one_shot function. The batch_size argument is not used, but I'm sure you know better than me how to create the batches with your internal functions.
the " EAV API method exports returned two values for the same participant/event." The issue mentioned above is due to some old records that have not been cleaned in the database. This happens very rarely and we don't currently have an example. In our observations, the last row of the participant/event is the lasted, so something like ds_eav |> arrange(desc(row_number()) |> distinct() should solve the problem.

OuhscBbmc / REDCapR

Export of checkbox items and `_complete` fields #420