eth-mds / ricu

🏥 ICU data with R 🏥
https://eth-mds.github.io/ricu/
GNU General Public License v3.0
33 stars 11 forks source link

Make data source definition more flexible to allow for dataset updates #37

Closed mlondschien closed 6 months ago

mlondschien commented 11 months ago

Related: https://github.com/eth-mds/ricu/issues/33, https://github.com/eth-mds/ricu/issues/28.

The links (and thus versions) of data sources are currently hardcoded in the data-sources.json. This makes ricu unnecessarily inflexible. Adding a new data source requires changing the ricu source code (e.g., https://github.com/eth-mds/ricu/pull/30) as does benefitting from the data source's updates (e.g., caregiver_id was added to miiv tables in v2.2, which is currently not accessible via ricu).

prockenschaub commented 11 months ago

I agree with you that this is urgently needed.

This will probably require a bit more thought and some more involved restructuring/refactoring. Would you have a proposal on how to go about this and/or resources to support the implementation of such a proposal?

mlondschien commented 11 months ago

Honestly, my R knowledge is too limited to be of technical help here.

Design-wise, I believe that updating existing data sources might be tricky if you want to keep backward compatibility.

nbenn commented 9 months ago

@mlondschien I believe we briefly discussed this offline, but just to re-iterate: what you claim above is not true.

Adding a new data source requires changing the ricu source code (e.g., https://github.com/eth-mds/ricu/pull/30).

Data source definitions (as are concept definitions) are user configurable. Maybe the docs at ?ricu::load_src_cfg can serve as starting point for this. We have used this in several places ourselves (and I'm sure @dplecko could share his experience here). Having external data sets and integrating them with the "core" ricu data sets is a central piece of functionality offered by ricu. If you would want to use a newer version of one of ricu's data sets you can use this alongside the older version or in its place.

mlondschien commented 9 months ago

Thanks @nbenn! Some guidance on how to do this, optimally an example, would be great.

nbenn commented 9 months ago
  1. check out docs starting with ?ricu::load_src_cfg
  2. create your new source config as JSON (maybe call it something like mimic2, make sure (via the class_prefix entry, objects derived from this config inherit from the respective "mimic" classes; copy the mimic config ricu ships and change where necessary
  3. walk through the steps of download_src(), import_src() and attach_src(); most of this should work out of the box via inheritance; maybe check out docs here too
  4. make sure load_difftime(), load_src(), etc. work, but I believe they should, again via inheritance and assuming that there will not be changes between versions that affect the logic here
  5. add concepts for the new data source (mostly a copy/paste exercise again) to a local concept dictionary

I don't know in what ways newer mimic versions differ from what we had ~3 years ago. I'm assuming it's mostly adding some table(s), adding/removing patients. Maybe they dropped CareVue patients? That would simplify concept specification as fewer items have to be considered, but again seems largely backwards compatible. From a distance, I'm hoping that these changes do not really require changing mimic-specific logic but can mostly be handled by config.

Maybe also check in with @prockenschaub as he's working on adding new mimic to ricu I believe.

prockenschaub commented 9 months ago

I created a (temporary) repository that collects the all the different databases configs, including a short description of how to load them.

@mlondschien maybe you can have a look and check that it works for you. Long-term, I think it would be nice for backward compatibility to include those earlier versions with ricu. Maybe we can also turn the README in my repo into a proper vignette?

@nbenn while I agree that what Malte wants should already be possible, I think we need to provid more guidance for the user. From my experience, the learning curve of ricu can be quite steep. I think shipping configs for each version of MIMIC IV could already go a long way of easing people into it and also highlighting what parts of the config need to be changed to achieve what. load_difftime() etc. thankfully have not been affected by version changes so far. What may change is concepts, though, so a systematic way to not copy paste but inherit and, where necessary, overwrite concepts for new database versions may make sense.

dplecko commented 6 months ago

I have discussed this with @nbenn. While I can see that having dataset versioning would be useful, this would create quite a lot of overhead in terms of re-structuring things. Therefore, we will not prioritize this currently, given the fact that other things have a better value/time invested ratio. Instead, we have opted for a different solution: we will try to be up-to-date with the "latest" version of each dataset.

For cases in which an older version of a dataset is needed, one may add this dataset locally. If you think examples for how this could be done would be helpful, please feel free to open a new issue. Adding local datasets is described to some degree here.