cdisc-org / cdisc-rules-engine

Open source offering of the cdisc rules engine
MIT License
45 stars 12 forks source link

Support multiple datasets in a single file #673

Open ASL-rmarshall opened 3 months ago

ASL-rmarshall commented 3 months ago

(Regarding additional changes to support cli use for USDM validation, #631) There is one item I don't quite agree with:

  • The ".json" at the end is needed so that the correct data reader is used.

It seems wrong to pretend that a single json file is multiple json files at the Engine level. I think it would be better to fix the engine to be able to handle different types of dataframe collections (folder of files, single file, cosmos collection of items, etc). But this might be a much bigger change, in which case the hack is okay for now and you can create a new ticket for it instead.

Originally posted by @gerrycampion in https://github.com/cdisc-org/cdisc-rules-engine/pull/631#pullrequestreview-1977457315

ASL-rmarshall commented 3 months ago

To validate a USDM study definition contained in a single JSON file, the JSON file is converted to represent each USDM class as a separate dataset, with each class instance represented as a row in the dataset. For validation purposes, the JSON file therefore contains multiple "datasets". The rules engine currently has some implicit assumptions that each dataset to be validated will be contained in a separate file (e.g., dataset_name is frequently expected to contain the file name, in particular when dataset information is cached).

To get the CLI validation working as an interim solution, a unique "proxy file name" was generated and assigned for each dataset contained in the single JSON file (see #631), so that this "proxy file name" could be used as the dataset_name by the rules engine. The "proxy file name" was generated by appending the USDM class name to the original single JSON file name and adding a ".json" extension. For example, the "proxy file name" for the "Study" class dataset in a single JSON file called "/test/data/study_def.json" would be "/test/data/study_def.json/Study.json". This format was chosen to give a unique value for each USDM class dataset, and because the ".json" extension may be used to select the data or metadata readers.

It would be better for the engine to be updated to support single files containing multiple datasets without having to create "proxy file names".