Date merging - Githubissues

yongrenjie commented 8 months ago

Not yet implemented, but this is a rough draft of what we have in mind

The feature JSON shall be extended with a preprocess entry:

{
    "transformation_type": "...",
    "output_feature_name": "...",
    "preprocess": {
        "on": "cis_marker",
        "min": ["admission_date"],
        "max": ["discharge_date"]
    },
    ...
}

The meaning of this is that the value in the admission_date column will be replaced with the minimal (i.e. earliest) value of admission_date in all rows with the same value of cis_marker.

(Likewise with discharge_date but using the maximal, i.e. latest, value instead.)

The library shall then preprocess the input table as such before running the remainder of the transformation. Here is an example of the preprocessing step:

before preprocessing
--------------------
admission_date  discharge_date  cis_marker  episode_within_cis  something
2023-02-14      2023-02-17      100         1                   123
2023-02-17      2023-02-19      100         2                   456
2023-02-19      2023-02-20      100         3                   789

after preprocessing
-------------------
admission_date  discharge_date  cis_marker  episode_within_cis  something
2023-02-14      2023-02-20      100         1                   123
2023-02-14      2023-02-20      100         2                   456
2023-02-14      2023-02-20      100         3                   789

At this point if we want to:

count the number of stays satisfying some criterion
we can filter and then perform NUNIQUE on cis_marker
perform some other transformation on the first episode of each stay
we can filter for episode_within_cis == 1 and then perform the desired transformation

Generalisations

It should be straightforward to apply preprocess.min and preprocess.max to multiple columns.

We could potentially have entries such as preprocess.first and preprocess.last, which would replace the values in the given column with the value of the first row having the same cis_marker.

Note that the above assumes that the column to be merged on (i.e. preprocess.on) is the same throughout. In principle one may want to have multiple preprocessing steps with different on columns, but the use case for this is not clear.

simonrnss commented 7 months ago

One possible generalisation -- in this data we do have the handy episode_within_cis marker but in general it may not. So, could we have the capability to filter more generally (e.g. min date?). Might get tricky as there would be some edge cases around how to handle ties...

If it's just equality filtering then that's fine -- the onus would be on the user to populate a column that allowed equal filtering prior to processing.

yongrenjie commented 7 months ago

Cheers @simonrnss. Hmm this is interesting. There's a fair bit of information to encode here. Maybe we could preprocess to retain only

the entry with the smallest value of
admission date
for all entries with the same value of
cis_marker

Maybe for the original case (where we overwrite the values of min admission and max discharge date)

{
    "transformation_type": "...",
    "output_feature_name": "...",
    "preprocess": {
        "on": "cis_marker",
        "replace_with_min": ["admission_date"],
        "replace_with_max": ["discharge_date"]
    },
    ...
}

and for this case (where we want to only keep the first row of each stay) we can do

{
    "transformation_type": "...",
    "output_feature_name": "...",
    "preprocess": {
        "on": "cis_marker",
        "retain_min": ["admission_date", "discharge_date"],
    },
    ...
}

and in code this would be

group by cis_marker
sort each group by admission_date and then discharge_date. Not sure what happens with further tiebreaks, but maybe at this point it's the user's responsibility to provide a new column
retain only the first row
ungroup

helendduncan commented 7 months ago

With the SMR04 data - sorting can occur without the episode_in_cis column by using the second spec from example above:

{ "transformation_type": "...", "output_feature_name": "...", "preprocess": { "on": "cis_marker", "retain_min": ["admission_date", "discharge_date"], }, ... }

Here the ordering is primarily dictated by admission date and then secondly ordered by discharge date in the case of a tie break of the former.

Any other pre-processing will be the responsibility of the user

alan-turing-institute / eider

Date merging #45

Generalisations