Closed yongrenjie closed 7 months ago
One possible generalisation -- in this data we do have the handy episode_within_cis
marker but in general it may not. So, could we have the capability to filter more generally (e.g. min date?). Might get tricky as there would be some edge cases around how to handle ties...
If it's just equality filtering then that's fine -- the onus would be on the user to populate a column that allowed equal filtering prior to processing.
Cheers @simonrnss. Hmm this is interesting. There's a fair bit of information to encode here. Maybe we could preprocess to retain only
Maybe for the original case (where we overwrite the values of min admission and max discharge date)
{
"transformation_type": "...",
"output_feature_name": "...",
"preprocess": {
"on": "cis_marker",
"replace_with_min": ["admission_date"],
"replace_with_max": ["discharge_date"]
},
...
}
and for this case (where we want to only keep the first row of each stay) we can do
{
"transformation_type": "...",
"output_feature_name": "...",
"preprocess": {
"on": "cis_marker",
"retain_min": ["admission_date", "discharge_date"],
},
...
}
and in code this would be
With the SMR04 data - sorting can occur without the episode_in_cis
column by using the second spec from example above:
{ "transformation_type": "...", "output_feature_name": "...", "preprocess": { "on": "cis_marker", "retain_min": ["admission_date", "discharge_date"], }, ... }
Here the ordering is primarily dictated by admission date and then secondly ordered by discharge date in the case of a tie break of the former.
Any other pre-processing will be the responsibility of the user
Not yet implemented, but this is a rough draft of what we have in mind
The feature JSON shall be extended with a
preprocess
entry:The meaning of this is that the value in the
admission_date
column will be replaced with the minimal (i.e. earliest) value ofadmission_date
in all rows with the same value ofcis_marker
.(Likewise with
discharge_date
but using the maximal, i.e. latest, value instead.)The library shall then preprocess the input table as such before running the remainder of the transformation. Here is an example of the preprocessing step:
At this point if we want to:
we can filter and then perform NUNIQUE on cis_marker
we can filter for
episode_within_cis == 1
and then perform the desired transformationGeneralisations
It should be straightforward to apply
preprocess.min
andpreprocess.max
to multiple columns.We could potentially have entries such as
preprocess.first
andpreprocess.last
, which would replace the values in the given column with the value of the first row having the samecis_marker
.Note that the above assumes that the column to be merged on (i.e.
preprocess.on
) is the same throughout. In principle one may want to have multiple preprocessing steps with differenton
columns, but the use case for this is not clear.