eth-mds / ricu

🏥 ICU data with R 🏥
https://eth-mds.github.io/ricu/
GNU General Public License v3.0
35 stars 10 forks source link

Allow levels to be missing for `fct_concept` #40

Open mlondschien opened 1 year ago

mlondschien commented 1 year ago

The docs state

' * fct_cncpt: In case of categorical concepts, such as sex, a set of

' factor levels can be specified, against which the loaded data is checked.

Currently, a set of factor levels needs to be specified. Sometimes this is not desirable. This PR allows not specifying levels in fct_cncpt.

nbenn commented 1 year ago

@mlondschien just had a quick look at this. Iiuc, what you're after is a concept class that allows for returning arbitrary strings. Sure, we could make the levels component optional. We could also add another concept, e.g. a chr_cncpt.

I guess the reason why we never felt a need for this is inter-source data harmonization: if you define a concept with categorical values, all data sources need to stick to the same "labels" otherwise you can't really do any "source-agnostic" down-steam work. For the concrete use-case ("caregiver"), is the set of levels very large? Or is there another reason for not specifying this ahead of time? Could this concept be implemented for other datasets as well?

Not super sure whether this actually works, as we never tried out this specifically. The heavy reliance on S3 was in part to allow for extensibility, so you might be able to define your own concept class (chr_cncpt), provide implementations for some ricu generics (load_concepts() being the obvious one) and you're good to go.

mlondschien commented 1 year ago

For the concrete use-case ("caregiver"), is the set of levels very large?

Yes, around ~10'000 levels.

Not super sure whether this actually works, as we never tried out this specifically.

I'm using this locally. It does work. Is there a downside to using levels here?

nbenn commented 1 year ago

@mlondschien I assume in that case you are not actually interested in the "values" themselves but more in a grouping this induces? If this is actually the case, why not simply return an "auto increment" ID (integer valued; in data.table you could use the .GRP "variable" for this)?

If you were interested in contributing such a concept to ricu, we'd have to think about how to do this across datasets. We'd have to somehow "namespace" the IDs, as the MIMIC caregiver no. 1 almost surely is distinct from the HiRID caregiver no. 1. But step by step.

mlondschien commented 1 year ago

@mlondschien I assume in that case you are not actually interested in the "values" themselves but more in a grouping this induces? If this is actually the case, why not simply return an "auto increment" ID (integer valued; in data.table you could use the .GRP "variable" for this)?

Yes, I one-hot encode them later. However, I prefer leaving them as is, as this simplifies downstream analysis when talking to collaborators. IIUC factors are nothing other than .GRP plus a dictionary under the hood, right?

If you were interested in contributing such a concept to ricu, we'd have to think about how to do this across datasets. We'd have to somehow "namespace" the IDs, as the MIMIC caregiver no. 1 almost surely is distinct from the HiRID caregiver no. 1. But step by step.

I currently simply treat source + caregiver as an identifier.

mlondschien commented 9 months ago

Any chance we could merge this? This would allow me to work with the main branch of ricu and not my personal fork.