eth-mds / ricu

🏥 ICU data with R 🏥
https://eth-mds.github.io/ricu/
GNU General Public License v3.0
37 stars 10 forks source link

Separate removal of NA and out-of-range values #10

Closed prockenschaub closed 2 years ago

prockenschaub commented 2 years ago

Problem

ricu currently provides a joint warning when out-of-range values or NA are removed during loading of a concept. For example, when loading temp (body temperature) in the eICU demo we obtain the following message:

library(ricu)
res <- load_concepts("temp", src = "eicu_demo")
#> -- Loading 1 concept -----------------------------------------------------------------------------------------------------------
#> * temp                                                                                                                        
#>   ( ) removed 1525347 (93.3%) of rows due to out of range (or `NA`) entries
#> --------------------------------------------------------------------------------------------------------------------------------

The current behaviour makes it hard to judge whether the rows were primarily removed because of missing values or because of out-of-range values. While missing values may indicate a sparsely filled matrix (as is the case here), many out-of-range values may instead suggest unit conversion errors. Currently, one needs to delve into concept definitions and source tables to identify which ones the cause.

Proposed solution

Separating these cases into distinct messages may make it clearer on a first glance why rows were excluded. While this adds an extra row to the output of a (verbose) load_concept call, I believe it would be worth it for clarity. The result for the above example is:

library(ricu)
res <- load_concepts("temp", src = "eicu_demo")
#> -- Loading 1 concept -----------------------------------------------------------------------------------------------------------
#> * temp                                                                                                                        
#>   ( ) removed 1522140 (93.1%) of rows due to `NA` values
#>   ( ) removed 3207 (0.2%) of rows due to out of range entries
#> --------------------------------------------------------------------------------------------------------------------------------

Note

This currently merges into main, as it seems to be the most up-to-date. Please let me know if you'd like me to merge into dev or another branch instead.

nbenn commented 2 years ago

Sure, sounds like a reasonable suggestion. There are a couple of CI failures, unrelated to your PR, I'll have to check out. This PR is still WIP? From the quick look I had, it doesn't seem this gives you the desired behavior, right?

prockenschaub commented 2 years ago

I had added a call to your function rm_na_val_var on line 314. As far as I understand it, this first removes all NAs and then also prints a message if any where removed. Then the boundaries are applied to this new x without missing data and another message is generated as per your original code if any rows were removed this way. Both message express the number of removed rows as a percentage of the original number of rows, now renamed to n_total.

This isn't the most beautiful solution and was mostly a proof of concept, can see if I can refactor it.