Sage-Bionetworks / sysbioDCCjsonschemas

SysBio DCC JSON schemas
1 stars 7 forks source link

GH action to automate PEC/AD dictionary table updates + template generation #116

Open avanlinden opened 2 years ago

avanlinden commented 2 years ago

We need a Github Action to periodically update the AD and PEC metadata dictionaries used by dccvalidator, and update the metadata template files available for data contributors. This will probably necessitate also automating schema registration when new template schemas are created, as I don't think that has happened yet.

Updating dictionaries:

Updating templates:

Automatically registering schemas in Synapse:

kelshmo commented 2 years ago

Things we need to figure out:

kelshmo commented 2 years ago

@avanlinden We have a GH action set on synapseAnnotations to register schemas (all synapseAnnotations keys + values) to Synapse.

For:

Automatically registering schemas in Synapse:

use: can we repurpose the register-schemas.R script from synapseAnnotations? when: changes to schemas credentials: service account

Does this step encompass registering the assay-specific schemas to Synapse? I'm still not totally clear on how these registered schemas will be used in practice :)

kelshmo commented 2 years ago

@danlu1 is going to lead this effort.

kelshmo commented 2 years ago

There are two cases where the dictionaries or templates will need to be updated:

  1. A key or value changes or is added to synapseAnnotations (a PR merged to synapseAnnotations)
  2. A template is added to sysbioDCCjsonschemas (a PR merged to this repository)

If anything changes in synapseAnnotations or sysbioDCCjsonschemas, the dictionaries will always need to be updated. If anything changes in synapseAnnotations or sysbioDCCjsonschemas, the templates maybe need to be updated.

In order to not over-engineer this script we could:

avanlinden commented 2 years ago

@kelshmo

Does this step encompass registering the assay-specific schemas to Synapse? I'm still not totally clear on how these registered schemas will be used in practice :)

My understanding is that the create_template_from_Syn_schema.py script requires that the template schemas be registered in Synapse in order to run. I believe they also need to be registered in order to be used by dccvalidator, if dccvalidator is configured to use JSON schemas rather than excel templates for validation.

It seems to me that in order to use those two functions, we would need to register schemas in this repo 1) if a new schema is created (i.e. new assay template) and 2) if a schema is changed (i.e. new key added to a template), but NOT in the case where new values are added to existing keys (because those are registered as part of the referenced mini-schemas via synAnn).

In order to not over-engineer this script we could:

  • Run create_template_from_Syn_schema.py, let's say, everyday and leverage forceVersion=FALSE in the store functionality to let Synapse determine if the template file has changed. Need to test this.

This seems like a reasonable approach. We don't change templates that often but when we do is unpredictable and we'd want the changes to be available quickly, so this might be the best bet.

kelshmo commented 2 years ago

You are correct - the template schemas are pulled down from Synapse!

So, Nicole had been the one to register new templates to Synapse. I believe her package schemann does this. Have you run any of that code @avanlinden?

kelshmo commented 2 years ago

Also, Tom has a fully functional system for setting up cronjobs! 🥇

thomasyu888 commented 2 years ago

As long as the service account doesn't access any PHI, you can utilize the kubernetes system.

avanlinden commented 2 years ago

@kelshmo I totally forgot about schemann, I haven't looked into it at all. I think her register-schema.R function is the same as the one that runs in synapseAnnotations, we should definitely use that.

Also this is totally a job for Tom's kubernetes cluster, no PHI in sight!

danlu1 commented 2 years ago

Things we need to figure out:

  • [x] Can a GH action be triggered by something that happens in another repository? If we can't do this, then we might need to rely on a daily cronjob as @avanlinden has suggested. (@danlu1 is working on answering this Q.)
  • [x] What are the Sage standards for setting up cronjobs (e.g. can we use Tom's kubernetes clusters)? (@kelshmo will work on answering this Q.)

As this post, repository_dispatch event can be used to trigger workflow executions from one repository to another.

kelshmo commented 2 years ago

Updating dictionaries:

Updating templates:

danlu1 commented 2 years ago

Example cron job: https://github.com/Sage-Bionetworks/porTools/blob/master/.github/workflows/update-ad-publications.yaml

avanlinden commented 2 years ago

Various helpful things:

danlu1 commented 2 years ago

Hold this issue since IBC team is trying to solve the same problem. Trying to coordinate with them and reopen the issue when a workaround is generated.

danlu1 commented 2 years ago

Since we do have a GH action to register schema, I will test and see if it works. Then, I will add another GH in sysbioDCCjsonschemas repo to updating dictionary and template once the changes has been merged to master branch.