Sage-Bionetworks / sysbioDCCjsonschemas

SysBio DCC JSON schemas
1 stars 7 forks source link

sysbioDCCjsonschemas

This repository holds the JSON schemas for the Systems Biology (SysBio) DCC.

Annotation Schema

Annotation schema can be found in the schema_annotations folder and are organized by consortium.

Annotations Table

There are python scripts in the code/python folder for generating metadata templates and annotation tables based on the metadata template schemas registered in Synapse.

NOTE: The scripts in this repository assume the latest versions of all JSON schema are registered. If you have added or changed a schema, ensure the schema has been registered before running the scripts.

create_Syn_table_from_Syn_schemas.py will generate a Synapse table of all terms found in a set of metadata templates. The set is determined by consortium using the config file (config/schemas.yml). There are options to create a new table or overwrite an existing table.

Synapse credentials: This script can be used with a SCHEDULED_JOB product in the AWS service catalog by providing a Synapse PAT as a scheduled job secret. The script looks for a secret passed in from the scheduled job, and if no secret is found, uses any provided local Synapse credentials to log in.

Parameters:

The annotation table can be created with a single command. Example:

python3 code/python/create_Syn_table_from_Syn_schemas.py \
  --config_file config/schemas.yml \
  --consortium PsychENCODE \
  new_table \
  --parent_synapse_id syn21786765 \
  --synapse_table_name pec_annots

The annotation table can be updated with a single command. Example:

python3 code/python/create_Syn_table_from_Syn_schemas.py \
  --config_file config/schemas.yml \
  --consortium PsychENCODE \
  overwrite_table \
  --table_synapse_id syn20981788 \

Metadata Templates

The metadata templates are located in the schema_metadata_templates folder and are organized by consortium.

Generate Metadata Template

Currently, there are two approaches to generate metadata templates.

  1. Generate the metadata template(s) using registered schames.

create_template_from_Syn_schema.py will generate either a .csv or .xlsx metadata template based on a registered metadata schema.

Parameters:

Code

   python3 create_template_from_Syn_schema.py \
     sysbio.metadataTemplates-pec.manifest \
     /home/ec2-user/sysbioDCCjsonschemas/config/schemas.yml \
     /home/ec2-user/sysbioDCCjsonschemas/schema_metadata_templates/PsychENCODE/manifest_metadata_template.json
  1. Generate the metadata template(s) using schematic workflow.

The schematic develop branch has been pulled as a submodule in this repository and named as schematic_dev. Follow the instructions below to set up development environment on your local.

  1. Install poetry

      curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python -
  2. Save poetry directory (export PATH=“$HOME/.poetry/bin:$PATH”) to .bashrc so you don’t have to source it when you reboot the terminal.

  3. Follow the instructions here to start virtual environment, install dependencies, and fill in credential files.

Metadata templates created by schematic are stored in schematic_schemas. This directory contains the data model csv and its derived jsonld schema. The json and xlsx directories contain individual schemas and template sheets, respectively. The code directory contains the scripts for creating these files. (deprecated since it does not inlucde the most recent changes in schematic develop branch)

Here is a step-by-step instructions on how to generate interactive excel metadata using schematic.

  1. Update data.model.csv data model by hand. Example: 1kD.data.model.csv. Note the only checks that are performed within Google Sheets is against the specified valid values and the regex match validation rule.

  2. Prerequisites: Make sure you have a minimal.model.jsonld and a service_account_creds.json file in your repository.

  3. Create and activate a virtual environment within which you can install the package:

poetry shell
  1. Convert data model to json schema (jsonld). Example:
schematic schema convert --base_schema ./minimal.model.jsonld ./1kD.data.model.csv
  1. Create a google sheet template and json for each data type. Example:
schematic manifest --config config.yml get -s -oa -p ./1kD.data.model.jsonld -t IndividualHumanMetadataTemplate1kD -dt IndividualHumanMetadataTemplate1kD

Check definition of arguments here.

  1. Manually download all the google sheets as excel (example google sheet). Using the google drive API would be clutch.

  2. Upload all the excel templates to Synapse, AD, PEC and 1kD.

  3. Register the json schemas to synapse by tinkering with register-schemas.py for each schema. (haven't test yet)

NOTE: Don't forget to commit and push the newly generated data model, model jsonld, json schema(s), excel template(s) to the schematic_schemas repo.

Docker

This repo contains a Dockerfile that can be used to build a docker image locally. Alternatively, the docker image is on Docker Hub under sagebionetworks/sysbioDCCjsonschemas.

Build Image Locally

If you'd like to build the docker image locally, clone this repo and open a terminal in the sysbioDCCjsonschemas folder. Then build image and run interactively.

git clone https://github.com/Sage-Bionetworks/sysbioDCCjsonschemas.git
cd sysbioDCCjsonschemas
docker build --no-cache -t sysbiodccjsonschemas .
docker run --rm -it sysbiodccjsonschemas
Pull Existing Image

If you'd like to use an existing image, then pull the docker image from Docker Hub. Below assumes pulling the latest version of the image. To use a different version, replace latest with the desired tag. The container can be run interactively once the image is pulled.

docker pull sagebionetworks/sysbiodccjsonschemas:latest
docker run --rm -it sagebionetworks/sysbiodccjsonschemas:latest

Because the docker image is not currently auto-deployed, it may be out of date with the repo. It is recommend to build the image locally or use git pull within the container to get the latest version if you are:

Usage

The docker container opens in bash at the top level of the sysbioDCCjsonschemas directory. The docker container will not have Synapse credentials. Due to this, follow these steps to log into Synapse. Note that this should be done every time you start a new container.

  1. Generate a Synapse Personal Access Token (PAT) by logging into Synapse and going to your profile settings. The token should be created with all permissions checked.

  2. Start a docker container using the docker image as specified above.

  3. Start python3, log into Synapse, and exit python.

    python3
    import synapseclient as synapse
    syn = synapse.Synapse()
    syn.login(authToken="your PAT", rememberMe=True)
    exit()
  4. Run the scripts needed (see below), with the desired parameters, using python3.