Support non-matching Key values and Join Type in Match Datasets

ASL-rmarshall commented 6 months ago

Two changes are implemented to support the joining of datasets by non-matching keys:

The Keys attribute of Match Datasets now accepts a list of either single variable variable names or pairs of variable names specified using "Left" and "Right" parameters.
- The variable name for the "Left" parameter is a variable from the primary rule dataset (the "left" dataset in the join)
- The variable name for the "Right" parameter is a variable from the "other" dataset being join, whose name is specified in the Name attribute (the "right" dataset in the join).
Keys values containing only a single variable name are implemented as before - i.e., as the name of the key variable in both datasets. The list of Keys may contain a mix of single variable names and Left/Right variable names.
A new, optional parameter, Join Type, is supported for Match Dataset for "standard" joins (i.e., joins that do not involved either RELREC or a relationship dataset). Currently, only two values are allowed:
- "inner" (the default): the same as previous functionality, where the combined dataset only contains rows where key values match in both datasets.
- "left": performs a "left" join, keeping all records from the primary rule dataset, appending only the rows from the "other" dataset that have matching key values.

Example use: merge the StudyDesign dataset with the STUDYEPOCH dataset on StudyDesign.id = STUDYEPOCH.parent_id, keeping all records from the StudyDesign dataset:

Match Datasets:
  - Join Type: left
    Keys:
      - Left: id
        Right: parent_id
    Name: STUDYEPOCH
...
Scope:
  Entities:
    Include:
      - StudyDesign

Code Changes:

Added join_types.py containing a specification of a new JoinTypes enum to constrain the supported join types.
Updated parse_datasets in rule.py to:
- Convert Left and Right key names to lower case when populating rule.datasets.match_key.
- Conditionally assign the value of Join Type to rule.datasets.join_type.
Updated utils.py to add a new method:
- get_sided_match_keys: returns a list of match keys which contains either any single variable name passed as a string, or the value of the specified "left"/"right" dictionary attribute.
Updated dataset_preprocessor.py to import the new enum and methods, and to update _merge_datasets to:
- Use the new get_sided_match_keys method in the population of left_dataset_match_keys and right_dataset_match_keys.
- Pass join_type into merge_sdtm_datasets (only) as a new argument, defaulting to "inner" if join_type was not included in rule.datasets.
Updated data_processor.py to import the new enum and update merge_sdtm_datasets to:
- Accept join_type as an argument and use it to:
  - Populate the how argument of the pandas merge method to specify how the datasets are merged
  - Populate the indicator argument of the pandas merge method to determine whether a merge indicator column should be included in the merge dataset (it's only included if join_type is not "inner"). When included, the merge indicator column is named "merge" (e.g., "_merge_AE") because a unique name is required for each merge.
- If join_type is "left" and the resultant dataset contains rows that were only present in the "left" dataset (i.e., "left_only" appears as a value in the "_merge" (merge indicator) column), replace NaN with None in:
  - Cells in rows where "_merge" = "left_only" and any non-numeric column (except "_merge") that appears in the result dataset but not the parent dataset.
  - Any numeric column.
  This is to prevent errors when converting merged dataset contents to JSON and to allow missing values to be correctly interpreted by the check operators such as empty.

Schema Changes:

The CORE-base.json schema file is updated to:

Add a new LeftRightKeys definition, include Left and Right properties which reference VariableName and are both required.
Change Keys to allow either VariableName or LeftRightKeys.

Unit Test Changes:

The following unit test files are updated:

tests/unit/test_rule.py: added parameter for test_parse_datasets method.
tests/unit/test_utilities/test_data_processor.py: added test_merge_datasets_on_left_join method.
tests/unit/test_dataset_preprocessor.py: added test_preprocess_left_join method.

gerrycampion commented 6 months ago

Instead of parsing free text, I think it would be better for keys to look like: Keys:

left: id right: parent_id
...

This means keys can be a list of strings (for matching key names) or a list of objects for joins like this.

ASL-rmarshall commented 6 months ago

Instead of parsing free text, I think it would be better for keys to look like: Keys:

left: id right: parent_id

...

This means keys can be a list of strings (for matching key names) or a list of objects for joins like this.

@gerrycampion I've implemented this change - thanks. I implemented it so that the list can contain either strings or dicts, or both, for example:

    Keys:
      - STUDYID
      - left: USUBJID
        right: RSUBJID

would match on left.STUDYID = right.STUDYID and left.USUBJID = right.RSUBJID

ASL-rmarshall commented 6 months ago

@gerrycampion I have a few questions, both about the way I've implemented this and about what's needed to complete the pull request:

Is it OK for the left and right parameters/keys in Keys to be in lower case - or should the be Left and Right to align with text casing in other Match Datasets parameters? At the moment they get passed directly into rule.dataset.match_key without any modification or validation. As other Match Datasets parameters are converted to lower case when passed into rule.dataset.match_key, Left and Right would probably also need to be converted to lower case for consistency.

Agreed that it would be better to maintain consistency, so left and right will be changed to Left and Right.

The expected values for Join Key are currently "inner" or "left" in lower case. Is this OK, or would upper case values better to align with (some of the) other enum values used in the rules schema? The lower case values are passed straight into the how parameter of the pandas merge method, so changing to upper case would require additional processing.

As these values are being passed straight into the merge method, it is OK to leave them in lower case.

What other changes are needed to complete this pull request? I'm assuming that the following changes would also be needed:

Update resources\schema\CORE-base.json to modify the requirements for Keys and add the requirements for Join Type

Yes - the schema should be updated.

Update the documentation for Match Datasets to include the updated/new specification for Keys and Join Type? Or would this be a separate PR in the conformance-rules-editor repo.

Yes - the Match Datasets documentation should be updated. This will be a separate PR for the conformance-rules-editor repo, with the new PR also linked to the same issue. We agree to update the documentation in place instead of moving it into the cdisc-rules-engine repo (as this would still require an update to the MD file in the conformance-rules-editor repo)

Add unit tests. I've already drafted an update to test_rule.py to add a new parameter (with non-matching keys and join type) for test_parse_datasets, but I'm a bit unsure about what else would be needed. For example

Should there be a separate test for the new get_sided_match_keys method?

What would you recommend for testing left join functionality?

There are no specific guidelines - tests should be adequate to demonstrate correct functionality.

Are these assumptions correct and is there anything else?

Assumptions are correct - there is nothing else that needs to be updated.

cdisc-org / cdisc-rules-engine