cdisc-org / cdisc-rules-engine

Open source offering of the cdisc rules engine
MIT License
46 stars 12 forks source link

Rule blocked: CORERULES-9552 - checking values against Library CT for DDF #794

Open ASL-rmarshall opened 1 month ago

ASL-rmarshall commented 1 month ago

JSON containing the full request:

Links to related JIRA Tickets

Rule Information

Describe the bug I was looking for an existing operator that I could use to check if codes/decodes used in USDM data are valid when compared to a codelist included an a CT package extracted from the Library. Looking through the Rules Editor Reference Guide, I found the does_not_use_valid_codelist_terms operator, which looked like it might provide some of the required functionality. Although this operator is listed in the Define XML section of the Reference Guide (implying that it might only work for Define XML rule types, which wouldn't be appropriate for USDM because USDM doesn't use Define XML), I found lots of rules in the Rules Editor where this operator is used with a Rule Type of "Record Data" (e.g., CG0001, CG0555-CG0560, AD0039, SEND40, FDAB017). I therefore tried to use the operator, but it failed - none of the invalid values included in the negative test data were reported, and there was no other message indicating what this issue was.

By looking at the engine code and doing some debugging, I found that there appear to be several issues:

  1. There are no details in the Reference Guide explaining when the uses_valid_codelist_terms and does_not_use_valid_codelist_terms operators may be used and how the operator parameters should be populated.
  2. The checking functionality (done by the valid_terms method in dataframe_operators.py relies on DataframeType.codelist_term_maps which is only populated when Rule Type is "Define Item Metadata Check". These operators will therefore never work for rules with a Rule Type of "Record Data".
  3. The wording of the operator names suggests that :

    • name should be the name of a variable.
    • value should be a reference to a codelist.

    However, when the rule operator parameters are passed into the valid_terms method by the uses_valid_codelist_terms operator:

    • the name parameter ("target") is used to identify dataset column(s) and is passed in as codelist
    • the value parameter ("comparator") is used to identify dataset column(s) passed in as terms_list

    If name and value are specified, respectively, as a variable/column name and a codelist name/identifier in a "Record Data" rule, the call to valid_terms crashes (without reporting an error) because there is no column in the dataset which has the same name as the specified codelist.

    Looking at the test_does_not_use_valid_codelist_terms method in test_codelist_checks.py, it appears that the expectation might be that the name and value attributes used with these operators should be derived using operations that extract information from Define XML. If this is the case, the Reference Guide should be updated to indicate which operations should be used (with examples).

  4. The format of the CT cache files does not support proper checking for valid values:
    • It looks like codelists are indexed by C-code, not codelist short name. This means that any check that specifies a codelist by short name ("submission value" for the codelist), will not work.
    • The allowed_terms attribute for each codelist contains terms that are not valid - it contains submission values (which are valid), preferred terms (which are not valid as data values), and synonyms (which are also not valid as data values).
    • The C-codes for the terms in the codelist are not included, meaning that it's not possible to check for valid C-codes (which would be needed both for USDM and for TSVALCD in the SDTM TS dataset).

Error returned from Rule Engine (None reported)

Expected behavior It looks like the uses_valid_codelist_terms and does_not_use_valid_codelist_terms operators are not designed to work in the way that they're being used in lots of checks. The Reference Guide should therefore be updated so that it's clear when and how these operators should and should not be used.

There is a need for operators that can be used to check for use (or non-use) of valid CT values in a variable independently of Define XML (i.e., by referencing cached Library CT directly). These operators should be able to check both for valid "submission values" and for valid C-codes for the terms in the specified codelist.

The cached CT files need to be updated so that:

SFJohnson24 commented 1 week ago

I spoke with Gerry today re: this ticket and how engine handles CT currently. To integrate this and make it fairly seemless across different standards--it will require a bit more work than initially intended. I will be creating a few tickets for this to split it up into smaller PBIs with the goal of creating a function that can handle CT checks for both DDF and SDTM