AD.model.* (csv | jsonld): this is the current, "live" version of the AD Portal data model. It is being used by both the staging and production versions of the multitenant Data Curator App.
:warning: Do not edit AD.model.csv
or AD.model.jsonld
by hand! :warning:
The main branch of this repo is protected, so you cannot push changes to main. To make changes to the data model:
The full AD.model.csv
file has over 1400 attributes and is unwieldy to edit and hard to review changes for. For ease of editing, the full data model is divided into "module" subfolders, like so:
data-models/
├── AD.model.csv (do not edit!)
├── AD.model.jsonld (do not edit!)
└── modules/
├── biospecimen/
│ ├── specimenID.csv
│ ├── organ.csv
│ └── tissue.csv
└── sequencing/
├── readLength.csv
└── platform.csv
Within each module, every attribute in the data model where Parent
= ManifestColumn
has its own csv, named after that attribute (example: organ.csv
). Any valid values of the attribute "organ" have Parent
= organ
and are listed as rows in the file organ.csv
. Attributes with Parent
= ManifestColumn
are used as columns in metadata and annotation manifest templates. Attributes with Parent
= ManifestTemplate
describe the templates themselves. At this time, any other value for Parent
means the attribute is a valid value of some other enumerated attribute.
Some common data model editing scenarios are:
modules/biospecimen/organ.csv
.Parent
column, make sure the value is "organ".MODEL-AD
subfolder, create a new csv called furColor.csv
with the required schematic column headers. Describe the attribute "furColor" as necessary and make sure Parent
= ManifestColumn
. Add any valid values for "furColor" as new rows to this csv as described in the previous scenario.modules/template/templates.csv
. In the "model-ad_individual_animal_metadata" row, add your new column "furColor" to the comma-separated list of attributes in the DependsOn
column.For more advanced data modeling scenarios like adding conditional logic, creating validation rules, or creating new manifests, please consult the #ad-dcc-team slack channel.
A persistent issue is that manually editing csvs is challening. Some columns in our modules are very short, and others are veeeeery long (Description, Valid Values). Some options for working on csvs, and their pros and cons:
We are exploring better solutions to this problem -- if you have ideas, tell us!
When you open a PR that includes any changes to files in the modules/
directory, a Github Action will automatically run before merging is allowed. This action:
assemble_csv_data_model.py
script to concatenate the modular attribute csvs into one data frame, sort alphabetically by Parent
and then Attribute
, and write the combined dataframe to AD.model.csv
. The action then commits the changes to the master data model csv.schematic
from PyPi and runs schema convert
on the newly-concatenated data model csv to generate a new version of the jsonld file AD.model.jsonld
. The action also commits the changes to the jsonld.If this automated workflow fails, then the data model may be invalid and further investigation is needed.
:warning: If you are working in a Github Codespace, do NOT commit any Synapse credentials to the repository and do NOT use any real human data when testing data model function. This is not a secure environment!
If you want to make changes to the data model and test them out by generating manifests with schematic
, you can use the devcontainer in this repo with a Github Codespace. This will open a container in a remote instance of VSCode and install the latest version of schematic. The devcontainer also installs the Rainbow CSV extension. You can make changes, commit them, and open a PR from the codespace.
Codespace secrets:
Previous versions of the data model live in the legacy-data-models/
folder. This include the Diverse Cohorts pilot model and the intial "legacy" model representing the AD Portal Synapse project metadata dictionary and metadata templates from August 2023. These are not being used by DCA.