alan-turing-institute / eider

eider: an R package for processing health records declaratively
https://alan-turing-institute.github.io/eider/
Other
2 stars 0 forks source link

Date merging #45

Closed yongrenjie closed 7 months ago

yongrenjie commented 8 months ago

Not yet implemented, but this is a rough draft of what we have in mind

The feature JSON shall be extended with a preprocess entry:

{
    "transformation_type": "...",
    "output_feature_name": "...",
    "preprocess": {
        "on": "cis_marker",
        "min": ["admission_date"],
        "max": ["discharge_date"]
    },
    ...
}

The meaning of this is that the value in the admission_date column will be replaced with the minimal (i.e. earliest) value of admission_date in all rows with the same value of cis_marker.

(Likewise with discharge_date but using the maximal, i.e. latest, value instead.)

The library shall then preprocess the input table as such before running the remainder of the transformation. Here is an example of the preprocessing step:

before preprocessing
--------------------
admission_date  discharge_date  cis_marker  episode_within_cis  something
2023-02-14      2023-02-17      100         1                   123
2023-02-17      2023-02-19      100         2                   456
2023-02-19      2023-02-20      100         3                   789

after preprocessing
-------------------
admission_date  discharge_date  cis_marker  episode_within_cis  something
2023-02-14      2023-02-20      100         1                   123
2023-02-14      2023-02-20      100         2                   456
2023-02-14      2023-02-20      100         3                   789

At this point if we want to:

Generalisations

It should be straightforward to apply preprocess.min and preprocess.max to multiple columns.

We could potentially have entries such as preprocess.first and preprocess.last, which would replace the values in the given column with the value of the first row having the same cis_marker.

Note that the above assumes that the column to be merged on (i.e. preprocess.on) is the same throughout. In principle one may want to have multiple preprocessing steps with different on columns, but the use case for this is not clear.

simonrnss commented 7 months ago

One possible generalisation -- in this data we do have the handy episode_within_cis marker but in general it may not. So, could we have the capability to filter more generally (e.g. min date?). Might get tricky as there would be some edge cases around how to handle ties...

If it's just equality filtering then that's fine -- the onus would be on the user to populate a column that allowed equal filtering prior to processing.

yongrenjie commented 7 months ago

Cheers @simonrnss. Hmm this is interesting. There's a fair bit of information to encode here. Maybe we could preprocess to retain only

Maybe for the original case (where we overwrite the values of min admission and max discharge date)

{
    "transformation_type": "...",
    "output_feature_name": "...",
    "preprocess": {
        "on": "cis_marker",
        "replace_with_min": ["admission_date"],
        "replace_with_max": ["discharge_date"]
    },
    ...
}

and for this case (where we want to only keep the first row of each stay) we can do

{
    "transformation_type": "...",
    "output_feature_name": "...",
    "preprocess": {
        "on": "cis_marker",
        "retain_min": ["admission_date", "discharge_date"],
    },
    ...
}

and in code this would be

helendduncan commented 7 months ago

With the SMR04 data - sorting can occur without the episode_in_cis column by using the second spec from example above:

{ "transformation_type": "...", "output_feature_name": "...", "preprocess": { "on": "cis_marker", "retain_min": ["admission_date", "discharge_date"], }, ... }

Here the ordering is primarily dictated by admission date and then secondly ordered by discharge date in the case of a tie break of the former.

Any other pre-processing will be the responsibility of the user