Test data template updates

ASL-rmarshall commented 3 months ago

Changes applied to create_template.py:

Changed the default output file name to include the USDM version extracted from the version number assigned for the USDM diagram in Enterprise Architect.
Replaced Multiple_Values sheet with separate sheets for primitive data type values (or nulls):
- string
- boolean
- float
- null
Added extra columns to align with USDM data service conversion:
- parent_entity, parent_id, parent_rel for the Study sheet.
- rel_type for all sheets
- Columns corresponding with attributes that are only defined in the API (and not the UML/CT) - e.g., instanceType
- Columns indicating the existence of values dictionary or list attributes. These columns have data type (row 3) shown as "Boolean" and have either "[Exists]" (for dictionary attributes or "[Any Exist]" appended to the column description (row 2)
Updated logic for matching singular/plural relationship names to Id/Ids attributes in the API spec to handle context and appliesTo in USDM v2.11.
Added column name-to-ordinal mapping for CT files to deal with different column lists for different USDM versions.
Added conditionality to removal of abstract classes to deal with inclusion/exclusion of abstract classes in CT file for different USDM versions

Removed generically named template file and replaced with version-specific versions for USDM v2.6 (previous design) and v2.11 (including all updates listed above)..

ASL-rmarshall commented 3 months ago

Hi @BSnoeijerCD - I've asked for your review for this pull request. I'm not expecting you to look at the code changes (unless you want to), but please could you download and review the USDM_2.11_Test_Data_Template.xlsx file. Please then start your review of this change request and either approve it (if everything looks OK) or post comment(s) with your findings. Thanks!

BSnoeijerCD commented 3 months ago

@ASL-rmarshall : some comments

sheets added for boolean, float, null, string
no label for BiomedicalConceptSurrogate class
study class will not have a parent. So corresponding items in test data sheet are not applicable.

and questions:

How should the boolean be filled in the example datasets? Y/N, True/False, 0/1 ?
Full name of class StudyProtocolDocumentVersion too long. Missing the last n. Should we refer in Core to StudyProtocolDocumentVersio (without the n)?

I will check the content of sheets as well today.

ASL-rmarshall commented 3 months ago

@BSnoeijerCD : some replies

@ASL-rmarshall : some comments

sheets added for boolean, float, null, string

Yes, this is deliberate. It is to mimic the behaviour of the USDM data service's conversion of JSON files, which creates separate datasets each showing the values for a primitive data type (boolean, float, string) or attributes with null as a value. See the Excel files of the USDM 2.11 example files that I posted in SharePoint.

no label for BiomedicalConceptSurrogate class

This is because there is no "CT Item Preferred Name" specified for the BiomedicalConceptSurrogate entity in the "DDF Entities&Attributes" sheet of the USDM 2.11 USDM_CT.xlsx file (cell G194).

study class will not have a parent. So corresponding items in test data sheet are not applicable.

For API data, the Study class has the (API-only) Wrapper class as its parent. Again, see the Excel representations of the USDM 2.11 example files that I posted in SharePoint.

and questions:

How should the boolean be filled in the example datasets? Y/N, True/False, 0/1 ?

Either as TRUE/FALSE (the Excel boolean values) or true/false as text values: Note that, by default, the first test data row in every sheet (i.e., the first row under the yellow header rows) is formatted to accept text values in every cell (the cells are formatted as "Text" instead of "General"). If you have multiple rows of test data to enter, you can use the Format Painter functionality to copy this text formatting onto additional rows. This will also prevent ISO8601 date values being converted automatically to Excel date values.

Full name of class StudyProtocolDocumentVersion too long. Missing the last n. Should we refer in Core to StudyProtocolDocumentVersio (without the n)?

This is a known issue. Our eventual intention is to be able to remove the ".xpt" suffix from the sheet names (see cdisc-org/cdisc-rules-engine#684), at which point there will be enough space for the full class name. Until then, if we need to reference the class name as a domain within the check specification - e.g., in a distinct operation - we would have to specify it without the "n" (and in upper case). It currently doesn't matter what you put in Scope because it's ignored anyway, so I would recommend using the full class name in Scope.

I will check the content of sheets as well today.

BSnoeijerCD commented 3 months ago

@ASL-rmarshall based on the content, I have the following comments:

BiomedicalConceptProperty, ScheduledDecisionInstance and AgentAdministration: something is going wrong in column J-M. Headers missing / content split.
Condition: Context and AppliesTo multiplied for each potential target (2x Context, 5x AppliesTo) but not distinguishable for Core I think. Also naming should be appliesToIds and contextIds
StudyCell.elements should be studyCell.elementIds (not sure if I missed others that cross-reference and are plural)
StudyArm.populations should be studyArm.populationIds
studyElement.studyInterventions should be studyElement.studyInterventionIds
activity.biomedicalConcepts should be activity.biomedicalConceptIds
activity.bcCategories should be activity.bcCategoryIds
activity.bcSurrogates should be activity.bcSurrogateIds
BiomedicalConceptCategory.children and BiomedicalConceptCategory.members should be childIds and MemberIds
ScheduledActivityInstance.activities should be activityIds
NarrativeContent.previous should be previousId
NarrativeContent.next should be nextId

ASL-rmarshall commented 3 months ago

@BSnoeijerCD Thanks. I have made some updates. Please download and check the updated Excel file.

@ASL-rmarshall based on the content, I have the following comments:

BiomedicalConceptProperty, ScheduledDecisionInstance and AgentAdministration: something is going wrong in column J-M. Headers missing / content split.

I'm not sure what this means - is it just that there appears to be a gap between the header rows for some of the columns when the sheet contains another column with a description (or other characteristic) that wraps onto multiple lines within its cell?

Condition: Context and AppliesTo multiplied for each potential target (2x Context, 5x AppliesTo) but not distinguishable for Core I think. Also naming should be appliesToIds and contextIds

Corrected - multiple occurrences of the same relationship name are now collapsed into a single column with the cardinality (row 4) shown a " / " delimited list of all the references. Column names corrected to be appliesToIds and contextIds.

Column names corrected to be <singular noun>Ids for all the following:

StudyCell.elements should be studyCell.elementIds (not sure if I missed others that cross-reference and are plural)

StudyArm.populations should be studyArm.populationIds

studyElement.studyInterventions should be studyElement.studyInterventionIds

activity.biomedicalConcepts should be activity.biomedicalConceptIds

activity.bcCategories should be activity.bcCategoryIds

activity.bcSurrogates should be activity.bcSurrogateIds

BiomedicalConceptCategory.children and BiomedicalConceptCategory.members should be childIds and MemberIds

ScheduledActivityInstance.activities should be activityIds

These two have also been corrected to show previousId and nextId to match the API spec, but according to the UML, they should be previousIds and nextIds - I've sent you an email about the inconsistency.

NarrativeContent.previous should be previousId

NarrativeContent.next should be nextId

BSnoeijerCD commented 3 months ago

@ASL-rmarshall : checked. All ok now.

cdisc-org / ddf-core-poc

Test data template updates #169