eliteportal / data-models

data models for the elite project
https://eliteportal.github.io/data-models/
MIT License
1 stars 1 forks source link

ELITE Data Model and Metadata Dictionary

As of 2024-07-17 this repo contains both the production data model used by the ELITE portal to submit and validate metadata through the Data Curator App; and the data dictionary website which is based on the data model and provides definitions for all metadata templates and terms used in the data model.

There is a separate data-dictionary repo which contains the same source code, and which can later be used to deploy the website when we are able to set up automation in that repository which successfully monitors this repository for changes. To simplify the process, for now we will use this data-models repo to manage both the data model and the dictionary.

EL Data Model

EL.data.model.* (csv | jsonld): this is the current, "live" version of the EL Portal data model. It is being used by both the staging and production versions of the multitenant Data Curator App.

Editing the data model

The main branch of this repo is protected, so you cannot push changes to main. To edit the data model, create a new branch of this repository and make changes to the attribute csv files in the modules/ subdirectory. Once you have made your changes, open a pull request. This will trigger a Github Action that automatically joins the attributes from the module csv, converts the csv data model to the json-ld format, and commits the changes to your PR. Please do not make changes to EL.data.model.csv or EL.data.model.jsonld by hand!

Editing attributes by module

The full EL.data.model.csv file has over 200 attributes and is unwieldy to edit and hard to review changes for. For ease of editing, the full data model is divided into "module" subfolders, like so:

data-models/
├── EL.data.model.csv (do not edit!)
├── EL.data.model.jsonld (do not edit!)
└── modules/
    ├── biospecimen/
    │   ├── specimenID.csv
    │   ├── organ.csv
    │   └── tissue.csv
    └── sequencing/
        ├── readLength.csv
        └── platform.csv

Within each module, every attribute in the data model has its own csv, named after that attribute (example: organ.csv).

Some common data model editing scenarios are:

Adding a new valid value to an existing manifest column

  1. If you wanted to add a new valid value "eyeball" to our existing column attribute "organ", after making a new branch and opening the repo either locally or within a codespace, you would go to modules/biospecimen/organ.csv.
  2. Next, find the row for the attribute "organ" (should be the first row), and w/in the valid values column, add "eyeball" to the comma-separated list of valid values.
  3. Save your changes and write an informative commit. Please try to add valid values alphabetically!

Adding a new column to a manifest template

  1. If you wanted to add the column "furColor" to the "model-ad_individual_animal_metadata" template, first decide which module the new column should belong to. In this case, "MODEL-AD" makes the most sense.
  2. W/in the MODEL-AD subfolder, create a new csv called furColor.csv with the required schematic column headers. Describe the attribute "furColor" as necessary and make sure Parent = ManifestColumn. Add any valid values for "furColor" as new rows to this csv as described in the previous scenario.
  3. Find the manifest template attributes in modules/template/templates.csv. In the "model-ad_individual_animal_metadata" row, add your new column "furColor" to the comma-separated list of attributes in the DependsOn column.
  4. Save your changes and write an informative commit.

For more advanced data modeling scenarios like adding conditional logic, creating validation rules, or creating new manifests, please consult the #ad-dcc-team slack channel.

Notes on collaboratively editing csvs

A persistent issue is that manually editing csvs is challening. Some columns in our modules are very short, and others are veeeeery long (Description, Valid Values). Some options for working on csvs, and their pros and cons:

Adding a new template

If you add a new template manifest (e.g. for a new assay type), remove an existing manifest, or rename a manifest, you need to update the dca-template-config.yml file that DCA uses to populate the menu contributors will use to select their template. To do this, you must manually trigger the Github Action create-template-config.yml. This will re-create the DCA template config file and open a new PR with the changes. Review and merge the PR to complete the template config update. You can use the default input values provided when you manually trigger this workflow.

EL Metadata Dictionary Site

The Metadata Dictionary site is at: https://eliteportal.github.io/data-models/.

EL Metadata Dictionary is a Jekyll site utilizing Just the Docs theme and is published on GitHub Pages.

Updating Metadata Dictionary Site via Github Action

  1. The dictionary site materials should be updated after you make changes to the data model (see). Once a PR with changes is reviewed and merged into main, the Github Action in update_metadata_dictionary.yml should automatically start. This action will update the files in _data/ and docs/ that are used to populate the dictionary website.

  2. Once any changes are detected in the _data/ or docs/ folders on the main branch, another Github action called pages.yml will run to update the deployment to the Github pages website. Verify that the dictionary site looks as expected at https://eliteportal.github.io/data-models/.

Other things you can do in this repository

Making changes WITHOUT Github Actions (locally or in a codespace):

editing data model in a github codespace

  1. Start your codespace or build a new one. The codespace should build with a container image that includes the package manager poetry. You don't need to install poetry. It should also run the command poetry install after you launch it, which will tell poetry to install all the python libraries that are specified by this project (this will include schematic).
  2. Make a new branch. On that branch, make and commit any changes. Please write informative commit messages in case we need to track down data model inconsistencies or introduced bugs.
  3. Still in the top-level directory, run poetry run python data_model_creation/join_data_model.py from the terminal. This will run a python script that joins all the module csvs, does a few data frame quality checks, and uses schematic schema convert to create the updated json-ld data model.
  4. If the script succeeds, double check the version control history of your json-ld data model and make sure the changes you expected have been made! Save and commit all changes, then push your local branch to the remote.
  5. Open a pull request and request review from someone else on the EL DCC team. The Github Action that runs when you open a PR will currently fail -- you can ignore this. EL DCC team will perform manual checks before merging changes.
  6. After the PR is merged, delete your branch.

editing data model locally

  1. Start your codespace or build a new one. The codespace should build with a container image that includes the package manager poetry. You don't need to install poetry. It should also run the command poetry install after you launch it, which will tell poetry to install all the python libraries that are specified by this project (this will include schematic).

Follow steps 2-4 above

  1. [Optional]: to generate a test manifest, run poetry run schematic manifest -c path/to/config.yml get -dt RelevantDataType -s from the terminal. This will generate a json schema, a manifest csv, and a link to a google sheet version of the manifest. DO NOT put any real data in the google sheet manifest! This is just an integration test to see if the manifest columns and drop downs look as expected. Don't commit the json schema and the manifest csv generated during this step to your branch -- these are ephemeral and should be deleted.
  2. Open a pull request and request review from someone else on the EL DCC team. The Github Action that runs when you open a PR will currently fail -- you can ignore this. EL DCC team will perform manual checks before merging changes.
  3. After the PR is merged, delete your branch.

updating dictionary site in a github codespace

  1. Start your codespace or build a new one. The codespace should build with a container image that includes the package manager poetry. You don't need to install poetry. It should also run the command poetry install after you launch it, which will tell poetry to install all the python libraries that are specified by this project (this will include schematic).

  2. Make a new branch.

  3. From the top-level data-models directory, run poetry run python processes/data_manager.py. This should update some files within _data/

  4. Then run poetry run python processes/page_manager.py. This should update files within docs/.

  5. Optional: you can run poetry run python processes/create_network_graph.py to create the schema visualization network graph. This is out of date and relatively unused, but it will be good to update and make more robust later.

  6. Commit changes to your branch and open a PR. After review is passed and the changes are merged to main, a Github action will run via the pages.yml workflow to build and deploy the site to https://eliteportal.github.io/data-models/

updating dictionary site locally

  1. Make sure you have the poetry dependency manager installed in your workspace.

Follow steps 2-5 from the section above

  1. Optional: Preview the website locally by running bundle exec jekyll serve.

  2. Commit changes to your branch and open a PR. After review is passed and the changes are merged to main, a Github action will run via the pages.yml workflow to build and deploy the site to https://eliteportal.github.io/data-models/

Building and previewing the jekyll site locally

  1. Install Jekyll gem install bundler jekyll
  2. Install Bundler bundle install
  3. Run bundle exec jekyll serve to build your site and preview it at http://localhost:4000. The built site is stored in the directory _site.

Scraping Valid Values from Ontology

❓status unknown

Use scraping_valid_values.py to pull in values from EBI OLS sources.

DCA config repo dispatch

❓status unknown

dcc_config_repo_dispatch.yml -- Not sure what this is for, still investigating its use. Authorization is failing.

Create Data Model Visualization Tree

Schematic API Visualization Repository

Developers

Software packages installed

Files

  1. EL.data.model.csv: The CSV representation of the example data model. This file is created by the collective effort of data curators and annotators from a community (e.g. ELITE), and will be used to create a JSON-LD representation of the data model.

  2. EL.data.model.jsonld: The JSON-LD representation of the example data model, which is automatically created from the CSV data model using the schematic CLI. More details on how to convert the CSV data model to the JSON-LD data model can be found here. This is the central schema (data model) which will be used to power the generation of metadata manifest templates for various data types (e.g., scRNA-seq Level 1) from the schema.

  3. config.yml: The schematic-compatible configuration file, which allows users to specify values for application-specific keys (e.g., path to Synapse configuration file) and project-specific keys (e.g., Synapse fileview for community project). A description of what the various keys in this file represent can be found in the Fill in Configuration File(s) section of the schematic docs.

To setup environment

After cloning the repository, run the following command: poetry install

Changes

./change-log.md