cdisc-org / cdisc-rules-engine

Open source offering of the cdisc rules engine
MIT License
45 stars 12 forks source link

Rule Type "Dataset Metadata Check against Define XML": define_* variables retrieved incorrectly when dataset_name not found in Define-XML document #633

Open mhungria opened 4 months ago

mhungria commented 4 months ago

define_* variables retrieved incorrectly when dataset_name not found in Define-XML document. It appears that the first available ItemGroupDef that satisfies the request conditions (if any) is used to return the metadata requested.

Also, dataset_name is changed to "None" in those cases. I assume it is a BUG and the the define_dataset_name is the one that should be returned as "None".

See test for CDISC.SDTMIG.CG0320

mhungria commented 4 months ago

See test data and Define-XML document under https://cdisc.sharepoint.com/:f:/r/sites/CORERules/Shared%20Documents/unitTesting/SDTMIG/CG0320/Rule_cg0320x_test?csf=1&web=1&e=5ZwP4m

See Results in the attached Rule.zip file: Rule.zip

SFJohnson24 commented 2 weeks ago

we discussed this ticket on 7/9/24 rule meeting and it was determined we should reopen it. Marcelina showed a case where the dataset's domain did not match the define.XML domain. The define.xml matched all conditions for the rule to produce a negative result but the dataset was skipped when a test was run. Currently engine is determining when to validate based on the dataset's domain: https://github.com/cdisc-org/cdisc-rules-engine/blob/7c3859875c77349390726391a067b07aa7590e85/scripts/run_validation.py#L79 The define should be the source of truth and should be referenced over the dataset domain in these types of validations.

SFJohnson24 commented 1 week ago

currently


    def build(self):
        """
        Returns a long dataset where each value in each row of the original dataset is
        a row in the new dataset.
        The define xml variable metadata corresponding to each row's variable is
        attached to each row.
        Columns available in the dataset are:

        dataset_size - File size
        dataset_location - Path to file
        dataset_name - Name of the dataset
        dataset_label - Label for the dataset

        Columns from Define XML:
        define_dataset_name - dataset name from define_xml
        define_dataset_label - dataset label from define
        define_dataset_location - dataset location from define
        define_dataset_class - dataset class
        define_dataset_structure - dataset structure
        define_dataset_is_non_standard - whether a dataset is a standard ``` 
There is a need for dataset_domain, define_dataset_domain for each dataset for these checks which should be included in these checks.  This should be added to this dataset builder.  The dataset builder also needs a build_split_dataset() functionality constructed for it
mhungria commented 1 week ago

I would adjust the statement “The define should be the source of truth and should be referenced over the dataset domain in these types of validations.” to “If a Define-XML is indicated and there is a conflict between Define-XML dataset metadata and the actual data’s dataset metadata, the Define-XML metadata should be referenced over the dataset metadata in these types of validations.,”

SFJohnson24 commented 1 week ago

so the acceptance criteria for this ticket is:

drewcdisc commented 1 week ago

In today's Daily Scrum we decided to table this PBI for the moment (moved back into our Sprint backlog) until we can decide if @SFJohnson24's 3rd bullet point above should be split off into its own PBI. @nickdedonder will gather the important contributors together so he can make that final call.