Open ASL-rmarshall opened 1 month ago
Note that the newly created USDMDataService.to_parquet
function will not work as currently written - it also seems to assume that there will be a single USDM JSON file for each dataset, which is not the case.
the parquet part of this ticket will go to another issue.
The updated engine does not convert a USDM JSON file to datasets for validation. I have tracked the issue to the following:
get_data_service
must passargs.dataset_paths
for the USDM data service to be correctly assigned (dataset_paths
is passed into theUSDMDataService.is_USDM_data
which checks the first file to see if it's a USDM JSON file, in which case the USDM data service is assigned).get_datasets
: this does not work for the way in which the USDM data service currently converts a USDM JSON file's contents into separate datasets. For USDM, we currently pass a single JSON file indataset_paths
, and the USDM data service converts this to separate datasets. This version ofget_datasets
is expecting a separate entry indataset_paths
for each dataset. It may be that a more systematic redesign is needed to handle multiple datasets contained in a single file but, in the meantime, inserting the following 2 lines right at the beginning ofget_datasets
(i.e., line 287) will fix the problem (by using the USDM data service'sget_datasets
function instead):get_raw_dataset_metadata
(line 122): this method currently returnsDatasetMetadata.records
as a string, which causes a datatype conflict when trying to sum dataset lengths. Returning a number instead fixes the problem, i.e., change:to:
Applying these suggested changes allows the USDM data service to convert a USDM JSON file to datasets for validation.