We assume that all input tables have been sufficiently 'cleaned up' i.e. data types are appropriate, etc. The cleaning of raw data is out of scope for this issue.
Features are specified as JSON in a declarative manner (i.e. the code specifies all the information required to calculate the value of a feature, such as but not limited to the source table, column name, lookback period, and the type of transformation to be performed)
Human-readable alternatives to JSON can be considered
One thing to consider is whether each feature should be its own file, or whether they should be combined into one file. So far the discussion has centred on the latter, but I could see the former being of use.
We aim to make the JSON schema as complete as possible to minimise the need for changes further down the line. We can draw on: (1) the example JSON file Simon provided; and (2) the spreadsheet containing the list of features.
Each feature should be associated with some kind of 'transformation_type', which could be for example: 'count' to count the number of occurrences of some event, 'sum' to add up all values, ...
The feature must also contain a way to specify a subset of rows / columns over which these transformations are to be performed, which typically looks like 'all rows containing V value in C column'.
It must be able to generalise to multiple values of V and C, and also combine them using OR or AND statements (i.e. contains V1 in C1 OR V2 in C2, vs. contains V1 in C1 AND V2 in C2). Negation (i.e. does NOT contain V1 in C1) is probably a good thing to include
The JSON files will be parsed into an appropriate R data structure to be determined.
Each feature is to be one object, which we tentatively call an RFeature (the equivalent in Python would almost certainly be a class).
This object should probably contain in its attributes entirely the same information as the individual features in the JSON file.
The library will have a family of functions, one per transformation_type, which take the raw data and the RFeature and produce a table containing the ids and the calculated feature for each id. We also generalise over the name of the column containing patient IDs. For example:
feature_count <- function(input_data, id_column_name, count_feature) {
# Get the requisite info from the feature object
feature_name <- count_feature$feature_column_name
conditions <- count_feature$conditions
# ...
# Then perform the transformation. This would be
# appropriate for a 'counting' transformation
input_data %>%
filter(conditions) %>%
group_by(.data[id_column_name]) %>%
summarise({feature_name} := n())
}
However, at this point we do not yet know the full structure of the RFeature object, so writing the full function above is dependent on steps 1 through 3 to be completed. In order to get around this bottleneck, we can start by writing functions with the requisite data passed in as function parameters instead:
feature_count2 <- function(input_data, id_column_name, feature_name, conditions, ...) {
# Perform the transformation directly using the function parameters
input_data %>%
filter(conditions) %>%
group_by(.data[id_column_name]) %>%
summarise({feature_name} := n())
}
The idea is that refactoring feature_count2 into feature_count should be fairly straightforward once we have an idea of what the data structure looks like.
Finally, the library must provide a way to join the resulting feature tables. It must also take care to not read in the same data source multiple times. (So it is likely that reading in the tables will be handled in a separate function altogether.) However, this is probably a discussion for a later date.
We assume that all input tables have been sufficiently 'cleaned up' i.e. data types are appropriate, etc. The cleaning of raw data is out of scope for this issue.
Features are specified as JSON in a declarative manner (i.e. the code specifies all the information required to calculate the value of a feature, such as but not limited to the source table, column name, lookback period, and the type of transformation to be performed)
'transformation_type'
, which could be for example:'count'
to count the number of occurrences of some event,'sum'
to add up all values, ...The JSON files will be parsed into an appropriate R data structure to be determined.
RFeature
(the equivalent in Python would almost certainly be a class).The library will have a family of functions, one per
transformation_type
, which take the raw data and theRFeature
and produce a table containing the ids and the calculated feature for each id. We also generalise over the name of the column containing patient IDs. For example:However, at this point we do not yet know the full structure of the
RFeature
object, so writing the full function above is dependent on steps 1 through 3 to be completed. In order to get around this bottleneck, we can start by writing functions with the requisite data passed in as function parameters instead:The idea is that refactoring
feature_count2
intofeature_count
should be fairly straightforward once we have an idea of what the data structure looks like.Finally, the library must provide a way to join the resulting feature tables. It must also take care to not read in the same data source multiple times. (So it is likely that reading in the tables will be handled in a separate function altogether.) However, this is probably a discussion for a later date.
@helendduncan to work on the spec (steps 1-2)
@yongrenjie to work on the R functions (step 4)