The updated engine does not convert a USDM JSON file to datasets for validation. I have tracked the issue to the following:
run_validation.py (line 40): the call to get_data_service must pass args.dataset_paths for the USDM data service to be correctly assigned (dataset_paths is passed into the USDMDataService.is_USDM_data which checks the first file to see if it's a USDM JSON file, in which case the USDM data service is assigned).
script_utils.pyget_datasets: this does not work for the way in which the USDM data service currently converts a USDM JSON file's contents into separate datasets. For USDM, we currently pass a single JSON file in dataset_paths, and the USDM data service converts this to separate datasets. This version of get_datasets is expecting a separate entry in dataset_paths for each dataset. It may be that a more systematic redesign is needed to handle multiple datasets contained in a single file but, in the meantime, inserting the following 2 lines right at the beginning of get_datasets (i.e., line 287) will fix the problem (by using the USDM data service's get_datasets function instead):
if data_service.standard == "usdm":
return data_service.get_datasets()
usdm_data_service.pyget_raw_dataset_metadata (line 122): this method currently returns DatasetMetadata.records as a string, which causes a datatype conflict when trying to sum dataset lengths. Returning a number instead fixes the problem, i.e., change:
records=f"{len(dataset)}",
to:
records=len(dataset),
Applying these suggested changes allows the USDM data service to convert a USDM JSON file to datasets for validation.
Note that the newly created USDMDataService.to_parquet function will not work as currently written - it also seems to assume that there will be a single USDM JSON file for each dataset, which is not the case.
The updated engine does not convert a USDM JSON file to datasets for validation. I have tracked the issue to the following:
get_data_service
must passargs.dataset_paths
for the USDM data service to be correctly assigned (dataset_paths
is passed into theUSDMDataService.is_USDM_data
which checks the first file to see if it's a USDM JSON file, in which case the USDM data service is assigned).get_datasets
: this does not work for the way in which the USDM data service currently converts a USDM JSON file's contents into separate datasets. For USDM, we currently pass a single JSON file indataset_paths
, and the USDM data service converts this to separate datasets. This version ofget_datasets
is expecting a separate entry indataset_paths
for each dataset. It may be that a more systematic redesign is needed to handle multiple datasets contained in a single file but, in the meantime, inserting the following 2 lines right at the beginning ofget_datasets
(i.e., line 287) will fix the problem (by using the USDM data service'sget_datasets
function instead):get_raw_dataset_metadata
(line 122): this method currently returnsDatasetMetadata.records
as a string, which causes a datatype conflict when trying to sum dataset lengths. Returning a number instead fixes the problem, i.e., change:to:
Applying these suggested changes allows the USDM data service to convert a USDM JSON file to datasets for validation.