Closed ASL-rmarshall closed 5 months ago
Instead of parsing free text, I think it would be better for keys to look like: Keys:
This means keys can be a list of strings (for matching key names) or a list of objects for joins like this.
Instead of parsing free text, I think it would be better for keys to look like: Keys:
- left: id right: parent_id
- ...
This means keys can be a list of strings (for matching key names) or a list of objects for joins like this.
@gerrycampion I've implemented this change - thanks. I implemented it so that the list can contain either strings or dicts, or both, for example:
Keys:
- STUDYID
- left: USUBJID
right: RSUBJID
would match on left.STUDYID = right.STUDYID and left.USUBJID = right.RSUBJID
@gerrycampion I have a few questions, both about the way I've implemented this and about what's needed to complete the pull request:
- Is it OK for the
left
andright
parameters/keys inKeys
to be in lower case - or should the beLeft
andRight
to align with text casing in otherMatch Datasets
parameters? At the moment they get passed directly intorule.dataset.match_key
without any modification or validation. As otherMatch Datasets
parameters are converted to lower case when passed intorule.dataset.match_key
,Left
andRight
would probably also need to be converted to lower case for consistency.
Agreed that it would be better to maintain consistency, so left
and right
will be changed to Left
and Right
.
- The expected values for
Join Key
are currently "inner" or "left" in lower case. Is this OK, or would upper case values better to align with (some of the) other enum values used in the rules schema? The lower case values are passed straight into thehow
parameter of thepandas
merge
method, so changing to upper case would require additional processing.
As these values are being passed straight into the merge
method, it is OK to leave them in lower case.
- What other changes are needed to complete this pull request? I'm assuming that the following changes would also be needed:
- Update
resources\schema\CORE-base.json
to modify the requirements forKeys
and add the requirements forJoin Type
Yes - the schema should be updated.
- Update the documentation for
Match Datasets
to include the updated/new specification forKeys
andJoin Type
? Or would this be a separate PR in theconformance-rules-editor
repo.
Yes - the Match Datasets
documentation should be updated. This will be a separate PR for the conformance-rules-editor
repo, with the new PR also linked to the same issue. We agree to update the documentation in place instead of moving it into the cdisc-rules-engine
repo (as this would still require an update to the MD file in the conformance-rules-editor
repo)
- Add unit tests. I've already drafted an update to
test_rule.py
to add a new parameter (with non-matching keys and join type) fortest_parse_datasets
, but I'm a bit unsure about what else would be needed. For example
- Should there be a separate test for the new
get_sided_match_keys
method?- What would you recommend for testing left join functionality?
There are no specific guidelines - tests should be adequate to demonstrate correct functionality.
Are these assumptions correct and is there anything else?
Assumptions are correct - there is nothing else that needs to be updated.
Two changes are implemented to support the joining of datasets by non-matching keys:
The
Keys
attribute ofMatch Datasets
now accepts a list of either single variable variable names or pairs of variable names specified using "Left" and "Right" parameters.Name
attribute (the "right" dataset in the join).Keys
values containing only a single variable name are implemented as before - i.e., as the name of the key variable in both datasets. The list ofKeys
may contain a mix of single variable names andLeft
/Right
variable names.Join Type
, is supported forMatch Dataset
for "standard" joins (i.e., joins that do not involved either RELREC or a relationship dataset). Currently, only two values are allowed:Example use: merge the StudyDesign dataset with the STUDYEPOCH dataset on
StudyDesign.id = STUDYEPOCH.parent_id
, keeping all records from the StudyDesign dataset:Code Changes:
join_types.py
containing a specification of a newJoinTypes
enum to constrain the supported join types.parse_datasets
inrule.py
to:Left
andRight
key names to lower case when populatingrule.datasets.match_key
.Join Type
torule.datasets.join_type
.utils.py
to add a new method:get_sided_match_keys
: returns a list of match keys which contains either any single variable name passed as a string, or the value of the specified "left"/"right" dictionary attribute.dataset_preprocessor.py
to import the new enum and methods, and to update_merge_datasets
to:get_sided_match_keys
method in the population ofleft_dataset_match_keys
andright_dataset_match_keys
.join_type
intomerge_sdtm_datasets
(only) as a new argument, defaulting to "inner" ifjoin_type
was not included inrule.datasets
.Updated
data_processor.py
to import the new enum and updatemerge_sdtm_datasets
to:join_type
as an argument and use it to:how
argument of thepandas
merge
method to specify how the datasets are mergedindicator
argument of thepandas
merge
method to determine whether a merge indicator column should be included in the merge dataset (it's only included ifjoin_type
is not "inner"). When included, the merge indicator column is named "mergeIf
join_type
is "left" and the resultant dataset contains rows that were only present in the "left" dataset (i.e., "left_only" appears as a value in the "_merge" (merge indicator) column), replaceNaN
withNone
in:This is to prevent errors when converting merged dataset contents to JSON and to allow missing values to be correctly interpreted by the check operators such as
empty
.Schema Changes:
The CORE-base.json schema file is updated to:
LeftRightKeys
definition, includeLeft
andRight
properties which referenceVariableName
and are both required.Keys
to allow eitherVariableName
orLeftRightKeys
.Unit Test Changes:
The following unit test files are updated:
tests/unit/test_rule.py
: added parameter fortest_parse_datasets
method.tests/unit/test_utilities/test_data_processor.py
: addedtest_merge_datasets_on_left_join
method.tests/unit/test_dataset_preprocessor.py
: addedtest_preprocess_left_join
method.