Closed gerrycampion closed 6 months ago
@gerrycampion While trying to get the CLI to work for the DDF rules, I found that there were a few things that didn't work quite right for the USDM data service and its interactions with other components. These were things that I didn't really look at or during my initial review (mainly because I didn't know what they were for). I have applied some "quick-and-dirty" fixes in my local clone to get the CLI to work more-or-less as expected (see below), and I think it would be good for these to be pushed here. What's the best way to do this - restore this branch or create a new one?
Updates I've made include:
USDMDataService.get_datasets
to return a list of datasets in a format that corresponds with the get_datasets
output format for other data services. To support this change, I also did a bit of reorganization in usdm_data_service.py
, for example:
dataset_content_index
to differentiate it from the list of dataset-level metadata that's expected to be created by get_datasets
.dataset_content_index
to the class level so that it's only done once.<dataset_path>\<domain>.json
(e.g., CDISC_Pilot_Study_dirty1.json\Study.json
). This was necessary because the engine looks like it's been designed with the expectation that each dataset will be in a separate file (even for Dataset-JSON), and it uses file name as dataset name. When multiple datasets are in a single file (as for USDM) this causes problems because dataset name (i.e. file name) is used in the cache key. I ended up using this format because:
os.path.join
means that the original file name is listed as the "Origin" in the Excel reportget_dataset
to accept the spoofed unique file name instead of the domain/entity name as dataset_name
.BaseDataService._replace_nans_in_numeric_cols_with_none
to get_dataset
to convert NaN
to None
in numeric columns (so that they're correctly picked up by operators such as empty
).USDM.yml
to include:
__get_entity_name
to use the parent entity to decide which of the multiple potential mappings is correct, which required changing its arguments and invocations to accept/pass node
objects so that these can be navigated to find the parent.validate_single_rule
in run_validation.py
to pass dataset_paths
to RulesEngine
and RulesEngine.__init__
in rules_engine.py
to accept it and pass it to DataServiceFactory.get_data_service
(instead of DataServiceFactory.get_service
), so that it can be used by USDMDataService.is_USDM_data
to determine whether the JSON file contains USDM data (I also tweaked is_USDM_data
so that it doesn't fall over if no dataset_paths
is provided).DataServiceFactory.__init__
in data_service_factory.py
to rely on standard == "ddf"
instead of standard == "usdm"
because the standard has to align with the Standards.Name
used in the rules editor.get_library_metadata_from_cache
in script_utils.py
not to fall over if there's no standard metadata available in the library cache.test_usdm_data.py
to align with these changes.FYI @nickdedonder @drewcdisc @RamilCDISC
All the changes described by @ASL-rmarshall are pretty good. These will allow me to validate if usdm data is being properly converted without any additional test resources. Also covering the cases so code does not fall over will improve the code quality. Looking forward to the changed code. Once we have it I can conclude the validation for this issue 35.
@nickdedonder @drewcdisc
@gerrycampion @RamilCDISC
I have restored the 35-support-usdm-data-source-in-engine
branch and have drafted PR #631 with the additional changes described above.
A new
USDMDataService
andUSDMJSONReader
is implemented to support reading USDM json files. The service converts a single json file to a set of datasets. Given JSON, a new dataset record is created at the root, for each unique item within a list, and for each property value. If a list item contains an object, that object becomes a new record within the parent's entity dataset and it becomes a new record within its own entity dataset. In addition to the recursive object properties, each record contains the following meta variables:parent_id
parent_rel
parent_entity
rel_type
get_datasets
was added to the dataservice interface and moved from utilities to individual dataservices, since it is handled differently within each dataservice.Refer to the unit tests for examples.
USDM_EliLilly_NCT03421379_Diabetes.json
was taken from the DDF repo and is being used as test data. USDM.yaml converts property names found in the test file to entity names. This may need to be updated. Soon, this will be obsolete when we support USDM version 2-9-0 or later which requires aninstanceType
property for all objects.