cdisc-org / cdisc-rules-engine

Open source offering of the cdisc rules engine
MIT License
46 stars 12 forks source link

35 support usdm data source in engine - additional changes to support cli #631

Closed ASL-rmarshall closed 4 months ago

ASL-rmarshall commented 5 months ago

Updates include:

ASL-rmarshall commented 5 months ago

Changes to consider:

ASL-rmarshall commented 5 months ago

Changes to consider:

  • To speed up execution, creation of self.json and self.dataset_content_index should only be created if needed (possibly move them back out of __init__ again). At the moment, this time-consuming process is performed each time a USDMDataService instance is initialized but, if there's already a cache containing dataset content and metadata, json and dataset_content_index may not be needed.

Instead of moving creation of self.dataset_content_index out of __init__, it is now cached with dataset_name = "USDM_content_index". This reduces execution time from ~6mins to ~3mins because the second invocation of USDMDataService retrieves it from the passed cache instead of recreating it.

gerrycampion commented 5 months ago

It's also our policy to resolve merge conflicts before approving PR

ASL-rmarshall commented 4 months ago

There is one item I don't quite agree with:

  • The ".json" at the end is needed so that the correct data reader is used.

It seems wrong to pretend that a single json file is multiple json files at the Engine level. I think it would be better to fix the engine to be able to handle different types of dataframe collections (folder of files, single file, cosmos collection of items, etc). But this might be a much bigger change, in which case the hack is okay for now and you can create a new ticket for it instead.

I think this is might be a bigger change. Prior to USDM, it seems that the engine assumes that there is a 1-to-1 relationship between file name and dataset name - there are several places where dataset_name is expected to contain the file name, which doesn't work if a file contains multiple datasets. I think it will need a reasonably significant change to unpick this.

Issue #673 created.

ASL-rmarshall commented 4 months ago

It's also our policy to resolve merge conflicts before approving PR

All merge conflicts have been resolved.