eth-mds / ricu

🏥 ICU data with R 🏥
https://eth-mds.github.io/ricu/
GNU General Public License v3.0
37 stars 10 forks source link

Best practice for col names when combining datasets #13

Open prockenschaub opened 2 years ago

prockenschaub commented 2 years ago

Problem

When combining databases (say MIMIC III and eICU), the names of the ID variables and the time variable depend on the order in which sources are passed to load_concepts. See the following reprex inspired by the quick start guide:

library(ricu)

src <- c("mimic_demo", "eicu_demo")

load_concepts("alb", src, verbose = FALSE)
#> # A `ts_tbl`: 6,657 ✖ 4
#> # Id vars:    `source`, `icustay_id`
#> # Units:      `alb` [g/dL]
#> # Index var:  `charttime` (1 hours)
#>       source     icustay_id charttime   alb
#>       <chr>           <int> <drtn>    <dbl>
#>     1 eicu_demo      141765  -2 hours   3.7
#>     2 eicu_demo      144815  -3 hours   4.2
#>     3 eicu_demo      144815   8 hours   3.6
#>     4 eicu_demo      145427  -6 hours   3.7
#>     5 eicu_demo      147307  -6 hours   3.5
#>     …
#> 6,653 mimic_demo     298685 130 hours   1.9
#> 6,654 mimic_demo     298685 154 hours   2
#> 6,655 mimic_demo     298685 203 hours   2
#> 6,656 mimic_demo     298685 272 hours   2.2
#> 6,657 mimic_demo     298685 299 hours   2.5

load_concepts("alb", rev(src), verbose = FALSE)
#># A `ts_tbl`: 6,657 ✖ 4
#># Id vars:    `source`, `patientunitstayid`
#># Units:      `alb` [g/dL]
#># Index var:  `labresultoffset` (1 hours)
#>      source     patientunitstayid labresultoffset   alb
#>      <chr>                  <int> <drtn>          <dbl>
#>    1 eicu_demo             141765  -2 hours         3.7
#>    2 eicu_demo             144815  -3 hours         4.2
#>    3 eicu_demo             144815   8 hours         3.6
#>    4 eicu_demo             145427  -6 hours         3.7
#>    5 eicu_demo             147307  -6 hours         3.5
#>    …
#>6,653 mimic_demo            298685 130 hours         1.9
#>6,654 mimic_demo            298685 154 hours         2
#>6,655 mimic_demo            298685 203 hours         2
#>6,656 mimic_demo            298685 272 hours         2.2
#>6,657 mimic_demo            298685 299 hours         2.5
#># … with 6,647 more rows

As you can see, although the information is exactly the same, the names depend on the order of src. This prevents me for example from simply appending two concepts from different databases:

bind_rows(
  load_concepts("alb", "mimic_demo", verbose = FALSE),
  load_concepts("alb", "eicu_demo", verbose = FALSE)
)

#> # A `ts_tbl`: 6,657 ✖ 5
#> # Id var:     `icustay_id`
#> # Index var:  `charttime` (1 hours)
#>       icustay_id charttime   alb patientunitstayid labresultoffset
#>            <int> <drtn>    <dbl>             <int> <drtn>
#>     1         NA  NA hours   3.4           3352333   2 hours
#>     2         NA  NA hours   3.3           3352333  11 hours
#>     3         NA  NA hours   3.1           3352333  36 hours
#>     4         NA  NA hours   3.4           3353113 -36 hours
#>     5         NA  NA hours   3.6           3353113  10 hours
#>     …
#> 6,653     201006   0 hours   2.4                NA  NA hours
#> 6,654     203766 -18 hours   2                  NA  NA hours
#> 6,655     203766   4 hours   1.7                NA  NA hours
#> 6,656     204132   7 hours   3.6                NA  NA hours
#> 6,657     204201   9 hours   2.3                NA  NA hours
#> # … with 6,647 more rows

Question

Am I missing something obvious here and am I supposed to do something differently? I did find the helper function id_vars and index_var that can help me recover what the names are but this seems cumbersome and does not allow me to only merge on a specific ID level (e.g. admissions) without remembering what this colum was called in the first database I passed to load_concept.

What was the reasoning underlying this design choice and would it be more practical to rename them directly to patient, hadm, and icustay, as returned e.g. by as_id_cfg(mimic_demo)?

dplecko commented 11 months ago

This behavior is by design (currently). If I understand correctly, the suggestion would be to have:

In this way, whenever load_concepts is invoked, meta_vars values would not depend on the data source from which the data is loaded.

I discussed this with @nbenn, and I think this may be a good suggestion. We should perhaps allow for this in the next version of ricu.