Closed ASL-rmarshall closed 4 months ago
Changes to consider:
self.json
and self.dataset_content_index
should only be created if needed (possibly move them back out of __init__
again). At the moment, this time-consuming process is performed each time a USDMDataService
instance is initialized but, if there's already a cache containing dataset content and metadata, json
and dataset_content_index
may not be needed.DatasetMetadata
should probably be changed to be a subclass of RepresentationInterface
and the creation of the dataset metadata dictionary in get_datasets
in the dummy, local and usdm data services should probably then be moved to a DatasetMetadata.to_representation
method.Changes to consider:
- To speed up execution, creation of
self.json
andself.dataset_content_index
should only be created if needed (possibly move them back out of__init__
again). At the moment, this time-consuming process is performed each time aUSDMDataService
instance is initialized but, if there's already a cache containing dataset content and metadata,json
anddataset_content_index
may not be needed.
Instead of moving creation of self.dataset_content_index
out of __init__
, it is now cached with dataset_name = "USDM_content_index"
. This reduces execution time from ~6mins to ~3mins because the second invocation of USDMDataService
retrieves it from the passed cache instead of recreating it.
It's also our policy to resolve merge conflicts before approving PR
There is one item I don't quite agree with:
- The ".json" at the end is needed so that the correct data reader is used.
It seems wrong to pretend that a single json file is multiple json files at the Engine level. I think it would be better to fix the engine to be able to handle different types of dataframe collections (folder of files, single file, cosmos collection of items, etc). But this might be a much bigger change, in which case the hack is okay for now and you can create a new ticket for it instead.
I think this is might be a bigger change. Prior to USDM, it seems that the engine assumes that there is a 1-to-1 relationship between file name and dataset name - there are several places where dataset_name
is expected to contain the file name, which doesn't work if a file contains multiple datasets. I think it will need a reasonably significant change to unpick this.
Issue #673 created.
It's also our policy to resolve merge conflicts before approving PR
All merge conflicts have been resolved.
Updates include:
USDMDataService.get_datasets
to return a list of datasets in a format that corresponds with theget_datasets
output format for other data services. To support this change, I also did a bit of reorganization inusdm_data_service.py
, for example:dataset_content_index
to differentiate it from the list of dataset-level metadata that's expected to be created byget_datasets
.dataset_content_index
to the class level and cached it so that it's only done once.<dataset_path>\<domain>.json
(e.g.,CDISC_Pilot_Study_dirty1.json\Study.json
). This was necessary because the engine looks like it's been designed with the expectation that each dataset will be in a separate file (even for Dataset-JSON), and it uses file name as dataset name. When multiple datasets are in a single file (as for USDM) this causes problems because dataset name (i.e. file name) is used in the cache key. I ended up using this format because:os.path.join
means that the original file name is listed as the "Origin" in the Excel reportget_dataset
to accept the spoofed unique file name instead of the domain/entity name asdataset_name
.BaseDataService._replace_nans_in_numeric_cols_with_none
toget_dataset
to convertNaN
toNone
in numeric columns (so that they're correctly picked up by operators such asempty
).USDM.yml
to include:__get_entity_name
to use the parent entity to decide which of the multiple potential mappings is correct, which required changing its arguments and invocations to accept/passnode
objects so that these can be navigated to find the parent.validate_single_rule
inrun_validation.py
to passdataset_paths
toRulesEngine
andRulesEngine.__init__
inrules_engine.py
to accept it and pass it toDataServiceFactory.get_data_service
(instead ofDataServiceFactory.get_service
), so that it can be used byUSDMDataService.is_USDM_data
to determine whether the JSON file contains USDM data (I also tweakedis_USDM_data
so that it doesn't fall over if nodataset_paths
is provided).get_dataset_variables
indata_processor.py
and_execute_operation
inrule_processor.py
both to useos.path.join
instead off"{}/{}"
to create dataset file references (so that these match the cache keys used byUSDMDataService
).get_library_metadata_from_cache
inscript_utils.py
not to fall over if there's no standard metadata available in the library cache.test_usdm_data.py
to align with these changes.__get_record_data
to assignFalse
whendict
orlist
value is empty.