35 support usdm data source in engine - additional changes to support cli

ASL-rmarshall commented 5 months ago

Updates include:

Changing USDMDataService.get_datasets to return a list of datasets in a format that corresponds with the get_datasets output format for other data services. To support this change, I also did a bit of reorganization in usdm_data_service.py, for example:
- To use different names for different things - e.g., only used "full_path" for the dataset path, changing to "content_path" for the JSONPath content pointers.
- Renamed the datasets-with-JSONPath-pointers dictionary as dataset_content_index to differentiate it from the list of dataset-level metadata that's expected to be created by get_datasets.
- Moved reading of the JSON file and creation of the dataset_content_index to the class level and cached it so that it's only done once.
Spoofing unique dataset names in a format that gets processed correctly downstream - i.e. <dataset_path>\<domain>.json (e.g., CDISC_Pilot_Study_dirty1.json\Study.json). This was necessary because the engine looks like it's been designed with the expectation that each dataset will be in a separate file (even for Dataset-JSON), and it uses file name as dataset name. When multiple datasets are in a single file (as for USDM) this causes problems because dataset name (i.e. file name) is used in the cache key. I ended up using this format because:
- Including the domain name makes a unique cache key (note that the cache always has to be used to access individual datasets because they can't just be re-read from a single file).
- Including it at the end using os.path.join means that the original file name is listed as the "Origin" in the Excel report
- The ".json" at the end is needed so that the correct data reader is used.
Updated things like get_dataset to accept the spoofed unique file name instead of the domain/entity name as dataset_name .
Adding a call to BaseDataService._replace_nans_in_numeric_cols_with_none to get_dataset to convert NaN to None in numeric columns (so that they're correctly picked up by operators such as empty).
Updating USDM.yml to include:
- More mappings to support the USDM 2.6 file we want to use for the demonstration.
- Two-tier mappings for attributes that can map to more than one entity. I also updated __get_entity_name to use the parent entity to decide which of the multiple potential mappings is correct, which required changing its arguments and invocations to accept/pass node objects so that these can be navigated to find the parent.
Updating validate_single_rule in run_validation.py to pass dataset_paths to RulesEngine and RulesEngine.__init__ in rules_engine.py to accept it and pass it to DataServiceFactory.get_data_service (instead of DataServiceFactory.get_service), so that it can be used by USDMDataService.is_USDM_data to determine whether the JSON file contains USDM data (I also tweaked is_USDM_data so that it doesn't fall over if no dataset_paths is provided).
Updating get_dataset_variables in data_processor.py and _execute_operation in rule_processor.py both to use os.path.join instead of f"{}/{}" to create dataset file references (so that these match the cache keys used by USDMDataService).
Changing get_library_metadata_from_cache in script_utils.py not to fall over if there's no standard metadata available in the library cache.
Changing test_usdm_data.py to align with these changes.
Changed __get_record_data to assign False when dict or list value is empty.

ASL-rmarshall commented 5 months ago

Changes to consider:

To speed up execution, creation of self.json and self.dataset_content_index should only be created if needed (possibly move them back out of __init__ again). At the moment, this time-consuming process is performed each time a USDMDataService instance is initialized but, if there's already a cache containing dataset content and metadata, json and dataset_content_index may not be needed.
Creation of the datasets could also be split between multiple processes to speed things up.
DatasetMetadata should probably be changed to be a subclass of RepresentationInterface and the creation of the dataset metadata dictionary in get_datasets in the dummy, local and usdm data services should probably then be moved to a DatasetMetadata.to_representation method.
There are currently no dataset/metadata rules defined for DDF, so functionality of the methods supporting those types of rule has not been tested yet.

ASL-rmarshall commented 5 months ago

Changes to consider:

To speed up execution, creation of self.json and self.dataset_content_index should only be created if needed (possibly move them back out of __init__ again). At the moment, this time-consuming process is performed each time a USDMDataService instance is initialized but, if there's already a cache containing dataset content and metadata, json and dataset_content_index may not be needed.

Instead of moving creation of self.dataset_content_index out of __init__, it is now cached with dataset_name = "USDM_content_index". This reduces execution time from ~6mins to ~3mins because the second invocation of USDMDataService retrieves it from the passed cache instead of recreating it.

gerrycampion commented 5 months ago

It's also our policy to resolve merge conflicts before approving PR

ASL-rmarshall commented 4 months ago

There is one item I don't quite agree with:

The ".json" at the end is needed so that the correct data reader is used.

It seems wrong to pretend that a single json file is multiple json files at the Engine level. I think it would be better to fix the engine to be able to handle different types of dataframe collections (folder of files, single file, cosmos collection of items, etc). But this might be a much bigger change, in which case the hack is okay for now and you can create a new ticket for it instead.

I think this is might be a bigger change. Prior to USDM, it seems that the engine assumes that there is a 1-to-1 relationship between file name and dataset name - there are several places where dataset_name is expected to contain the file name, which doesn't work if a file contains multiple datasets. I think it will need a reasonably significant change to unpick this.

Issue #673 created.

ASL-rmarshall commented 4 months ago

It's also our policy to resolve merge conflicts before approving PR

All merge conflicts have been resolved.

cdisc-org / cdisc-rules-engine

35 support usdm data source in engine - additional changes to support cli #631