cdisc-org / cdisc-rules-engine

Open source offering of the cdisc rules engine
MIT License
46 stars 12 forks source link

Rule blocked: CORERULES-9359 - check for no value by group #723

Closed ASL-rmarshall closed 1 week ago

ASL-rmarshall commented 3 months ago

Links to related JIRA Tickets

Rule Information

Describe the problem Background: The intention is to report a single error for each ScheduleTimeline (i.e., each record from the ScheduleTimeline dataset) where the timelineExitId column does not contain a value for any of the records in the ScheduledActivityInstance dataset that represent "children" of the ScheduleTimeline (i.e., where ScheduledActivityInstance.parent_id = ScheduleTimeline.id). The distinct operation can be used to generate a unique list of ScheduledActivityInstance.timelineExitId values grouped by parent_id, and the new functionality proposed for #706 allows the parent_id group key to be renamed to id so that the resultant lists can be merged onto the ScheduleTimeline dataset.

Problem: The problem occurs when trying to test if the result of the distinct operation contains no values. If there is no value in the timelineExitId column for any of the records in a group, the distinct operation returns a set containing None (i.e., {None}) and the empty operator does not recognize this as being empty (because it only looks for an empty string or None).

Describe the solution There are several ways in which this could be solved:

Proposed rule logic (based on updating the empty operator). See DDF00037 rule in the dev environment.

Check:
  all:
   - name: $timeline_exits
     operator: empty
Operations:
  - domain: SCHEDULEDACTIVITYINSTANCE
    group:
      - parent_id
    group_aliases:
      - id
    id: $timeline_exits
    name: timelineExitId
    operator: distinct
Scope:
  Entities:
    Include:
      - ScheduleTimeline
Sensitivity: Record

Note that for a full implementation that would work for converted USDM JSON data (i.e., running the rules via the CLI), additional changes are needed so that domain: SCHEDULEDACTIVITYINSTANCE in the Operations section can be replaced with a correctly interpreted reference to the ScheduledActivityInstance "dataset". This is (or will be) addressed under a separate issue.

Test data files

SFJohnson24 commented 3 months ago

@ASL-rmarshall I added the changes to the empty operator and updated the test for empty() and sent this ticket to review. There is no logic in Engine that I am aware of where we use {None} as a meaningful placeholder and would want to consider it as anything but empty, but there may be an edge case outside my domain knowledge that others might consider.

ASL-rmarshall commented 3 months ago

The update resolved the issue of {None} not being recognized as empty - thanks. As mentioned above, the original intent for the rule was to rely on the updates for #706 to allow the results of the distinct operation to be merged directly into the ScheduleTimeline dataset. I tried to work around this by merging to specific records in the ScheduledActivityInstance/ScheduledDecisionInstance datasets, but ran into another issue related to #705 (a NaN resulted from merging grouped results to a group without values) and the NaN was not recognized as being empty by the empty operator. Implementing the changes for #705 should unblock this rule. However, it would be worth considering updating the empty operator to recognize NaN (np.nan) as being empty as a more generic and robust fix:

results = np.where(self.value[target].isin(["", None, {None}, np.nan]), True, False)
                                                            ^^^^^^^^