cdisc-org / cdisc-rules-engine

Open source offering of the cdisc rules engine
MIT License
49 stars 13 forks source link

CLI validate no longer works for USDM JSON data #840

Open ASL-rmarshall opened 1 month ago

ASL-rmarshall commented 1 month ago

The updated engine does not convert a USDM JSON file to datasets for validation. I have tracked the issue to the following:

  1. run_validation.py (line 40): the call to get_data_service must pass args.dataset_paths for the USDM data service to be correctly assigned (dataset_paths is passed into the USDMDataService.is_USDM_data which checks the first file to see if it's a USDM JSON file, in which case the USDM data service is assigned).
    data_service = DataServiceFactory(
        config,
        shared_cache,
        max_dataset_size=max_dataset_size,
        standard=standard,
        standard_version=standard_version,
        library_metadata=library_metadata,
    ).get_data_service(args.dataset_paths)
                       ^^^^^^^^^^^^^^^^^^
  2. script_utils.py get_datasets: this does not work for the way in which the USDM data service currently converts a USDM JSON file's contents into separate datasets. For USDM, we currently pass a single JSON file in dataset_paths, and the USDM data service converts this to separate datasets. This version of get_datasets is expecting a separate entry in dataset_paths for each dataset. It may be that a more systematic redesign is needed to handle multiple datasets contained in a single file but, in the meantime, inserting the following 2 lines right at the beginning of get_datasets (i.e., line 287) will fix the problem (by using the USDM data service's get_datasets function instead):
    if data_service.standard == "usdm":
        return data_service.get_datasets()
  3. usdm_data_service.py get_raw_dataset_metadata (line 122): this method currently returns DatasetMetadata.records as a string, which causes a datatype conflict when trying to sum dataset lengths. Returning a number instead fixes the problem, i.e., change:
            records=f"{len(dataset)}",

    to:

            records=len(dataset),

Applying these suggested changes allows the USDM data service to convert a USDM JSON file to datasets for validation.

ASL-rmarshall commented 1 month ago

Note that the newly created USDMDataService.to_parquet function will not work as currently written - it also seems to assume that there will be a single USDM JSON file for each dataset, which is not the case.

cpshadle commented 1 month ago

the parquet part of this ticket will go to another issue.