CLI validate no longer works for USDM JSON data

ASL-rmarshall commented 1 month ago

The updated engine does not convert a USDM JSON file to datasets for validation. I have tracked the issue to the following:

run_validation.py (line 40): the call to get_data_service must pass args.dataset_paths for the USDM data service to be correctly assigned (dataset_paths is passed into the USDMDataService.is_USDM_data which checks the first file to see if it's a USDM JSON file, in which case the USDM data service is assigned).
```
data_service = DataServiceFactory(
    config,
    shared_cache,
    max_dataset_size=max_dataset_size,
    standard=standard,
    standard_version=standard_version,
    library_metadata=library_metadata,
).get_data_service(args.dataset_paths)
                   ^^^^^^^^^^^^^^^^^^
```
script_utils.py get_datasets: this does not work for the way in which the USDM data service currently converts a USDM JSON file's contents into separate datasets. For USDM, we currently pass a single JSON file in dataset_paths, and the USDM data service converts this to separate datasets. This version of get_datasets is expecting a separate entry in dataset_paths for each dataset. It may be that a more systematic redesign is needed to handle multiple datasets contained in a single file but, in the meantime, inserting the following 2 lines right at the beginning of get_datasets (i.e., line 287) will fix the problem (by using the USDM data service's get_datasets function instead):
```
if data_service.standard == "usdm":
    return data_service.get_datasets()
```
usdm_data_service.py get_raw_dataset_metadata (line 122): this method currently returns DatasetMetadata.records as a string, which causes a datatype conflict when trying to sum dataset lengths. Returning a number instead fixes the problem, i.e., change:
```
        records=f"{len(dataset)}",
```
to:
```
        records=len(dataset),
```

Applying these suggested changes allows the USDM data service to convert a USDM JSON file to datasets for validation.

ASL-rmarshall commented 1 month ago

Note that the newly created USDMDataService.to_parquet function will not work as currently written - it also seems to assume that there will be a single USDM JSON file for each dataset, which is not the case.

cpshadle commented 1 month ago

the parquet part of this ticket will go to another issue.

cdisc-org / cdisc-rules-engine

CLI validate no longer works for USDM JSON data #840