cdisc-org / cdisc-rules-engine

Open source offering of the cdisc rules engine
MIT License
46 stars 12 forks source link

Support non-matching Key values and Join Type in Match Datasets #621

Closed ASL-rmarshall closed 5 months ago

ASL-rmarshall commented 6 months ago

Two changes are implemented to support the joining of datasets by non-matching keys:

  1. The Keys attribute of Match Datasets now accepts a list of either single variable variable names or pairs of variable names specified using "Left" and "Right" parameters.

    • The variable name for the "Left" parameter is a variable from the primary rule dataset (the "left" dataset in the join)
    • The variable name for the "Right" parameter is a variable from the "other" dataset being join, whose name is specified in the Name attribute (the "right" dataset in the join).

    Keys values containing only a single variable name are implemented as before - i.e., as the name of the key variable in both datasets. The list of Keys may contain a mix of single variable names and Left/Right variable names.

  2. A new, optional parameter, Join Type, is supported for Match Dataset for "standard" joins (i.e., joins that do not involved either RELREC or a relationship dataset). Currently, only two values are allowed:
    • "inner" (the default): the same as previous functionality, where the combined dataset only contains rows where key values match in both datasets.
    • "left": performs a "left" join, keeping all records from the primary rule dataset, appending only the rows from the "other" dataset that have matching key values.

Example use: merge the StudyDesign dataset with the STUDYEPOCH dataset on StudyDesign.id = STUDYEPOCH.parent_id, keeping all records from the StudyDesign dataset:

Match Datasets:
  - Join Type: left
    Keys:
      - Left: id
        Right: parent_id
    Name: STUDYEPOCH
...
Scope:
  Entities:
    Include:
      - StudyDesign

Code Changes:

Schema Changes:

The CORE-base.json schema file is updated to:

Unit Test Changes:

The following unit test files are updated:

gerrycampion commented 6 months ago

Instead of parsing free text, I think it would be better for keys to look like: Keys:

This means keys can be a list of strings (for matching key names) or a list of objects for joins like this.

ASL-rmarshall commented 6 months ago

Instead of parsing free text, I think it would be better for keys to look like: Keys:

  • left: id right: parent_id
  • ...

This means keys can be a list of strings (for matching key names) or a list of objects for joins like this.

@gerrycampion I've implemented this change - thanks. I implemented it so that the list can contain either strings or dicts, or both, for example:

    Keys:
      - STUDYID
      - left: USUBJID
        right: RSUBJID

would match on left.STUDYID = right.STUDYID and left.USUBJID = right.RSUBJID

ASL-rmarshall commented 6 months ago

@gerrycampion I have a few questions, both about the way I've implemented this and about what's needed to complete the pull request:

  • Is it OK for the left and right parameters/keys in Keys to be in lower case - or should the be Left and Right to align with text casing in other Match Datasets parameters? At the moment they get passed directly into rule.dataset.match_key without any modification or validation. As other Match Datasets parameters are converted to lower case when passed into rule.dataset.match_key, Left and Right would probably also need to be converted to lower case for consistency.

Agreed that it would be better to maintain consistency, so left and right will be changed to Left and Right.

  • The expected values for Join Key are currently "inner" or "left" in lower case. Is this OK, or would upper case values better to align with (some of the) other enum values used in the rules schema? The lower case values are passed straight into the how parameter of the pandas merge method, so changing to upper case would require additional processing.

As these values are being passed straight into the merge method, it is OK to leave them in lower case.

  • What other changes are needed to complete this pull request? I'm assuming that the following changes would also be needed:
    • Update resources\schema\CORE-base.json to modify the requirements for Keys and add the requirements for Join Type

Yes - the schema should be updated.

  • Update the documentation for Match Datasets to include the updated/new specification for Keys and Join Type? Or would this be a separate PR in the conformance-rules-editor repo.

Yes - the Match Datasets documentation should be updated. This will be a separate PR for the conformance-rules-editor repo, with the new PR also linked to the same issue. We agree to update the documentation in place instead of moving it into the cdisc-rules-engine repo (as this would still require an update to the MD file in the conformance-rules-editor repo)

  • Add unit tests. I've already drafted an update to test_rule.py to add a new parameter (with non-matching keys and join type) for test_parse_datasets, but I'm a bit unsure about what else would be needed. For example
    • Should there be a separate test for the new get_sided_match_keys method?
    • What would you recommend for testing left join functionality?

There are no specific guidelines - tests should be adequate to demonstrate correct functionality.

Are these assumptions correct and is there anything else?

Assumptions are correct - there is nothing else that needs to be updated.