cancerDHC / ccdhmodel

CRDC-H model in LinkML, developed by the Center for Cancer Data Harmonization (CCDH)
https://cancerdhc.github.io/ccdhmodel/
BSD 3-Clause "New" or "Revised" License
16 stars 8 forks source link

Develop a workflow for connecting the terminology and model generation systems #46

Open gaurav opened 3 years ago

gaurav commented 3 years ago

In particular:

  1. The Terminology Service needs to be updated when the model is updated, particularly with new mappings.
  2. When the Terminology Service is updated with new PVs, the model needs to be updated.
joeflack4 commented 3 years ago

@gaurav Oh, you made this already! I made a reference to this in issues for terminology service repo: https://github.com/cancerDHC/ccdh-terminology-service/issues/26

@gaurav @jiaola My thought on this is to use continuous integration (CI). I've used CircleCI for a similar integration issue in the past. I used CircleCI because it was free at the time and it worked pretty well for that use case.

My thought on how this would work would be first adding CircleCI to this repo on GitHub. I don't remember how to do it exactly but it seems to set up all the hooks for you. But after that, there's a config file that you commit with the repository. We could then automatically run integration tests, which would be nice. But I imagine we could also run some other scripts, such as detecting qualifying model changes like you mentioned. If any changes found, I imagine we could have the script make some other changes to the terminology repo and submit an automated pull request.

Now that I think of it, maybe CI is only needed if we want hooks on deployment. We could probably run checks and automated updates on merged pull requests to this repository. I suppose that could be done using some other stuff, like GitHub actions, but I'm not sure.

gaurav commented 3 years ago

That's a great idea. I'm planning to use Github Actions for all the CI tasks on Tools repositories rather than CircleCI or Travis CI. It is free, is closely integrated with other Github features like pull requests, and can be activated in a number of ways, including manual activation.

I envision a GitHub Action that:

  1. Is manually activated, so that the data modelers can run it whenever they want to see what the model looks like.
  2. Will use pipenv to install all necessary Python packages.
  3. Will download the latest version of the Google Sheet using a Google Sheets API authentication token that is stored as an encrypted secret.
  4. Will regenerate all the model artifacts, including the documentation, and report any errors and warnings it ran into to whoever started the process.
  5. If model generation is successful, this will publish the documentation to the "dev" version of the model documentation, replacing the existing dev version. This will allow the data modelers to immediately see the updated model documentation, and report situations where the generated documentation is incorrect.

The only thing missing is regenerating the downstream artifacts, such as the Python Data Classes. I don't think we actually want to generate those in CI -- that sounds like something that should be regenerated only on published versions of the CRDC-H model (i.e. v1.0.1 rather than just dev releases).

We can't implement this right away, because I'm still working on #45, but I should be done with that soon and then we can try this out. If you want to take a stab at this, however, feel free to create a branch and try it out! I would ask that you create a pull request once you're done so I can review your changes before merging them in.

Note that there's a closely related issue -- #12 -- which covers testing the software in this repository. My plan there is to set up a Google Sheet that contains dummy data, and then write a series of tests that will run the Google Sheets to YAML code, run the YAML to artifacts code from LinkML, and ensure that we get valid output that matches our expectations. It might be easier to build that before tackling this issue.

joeflack4 commented 3 years ago

@gaurav That's a lot more detailed than what I was thinking! I think you're right about just using GitHub Actions as well. I like your suggestion on where I could participate. Dazhi is going to be gone for two weeks and suggested that I write some tests, albeit in a different repository, I believe. Maybe I can help you with some of these things before he gets back; no promises though. Also the idea for #12 makes sense to me.

@jiaola When you get back if you want to discuss this in detail, let me know. I still need to learn about the ins/outs of GitHub Actions but I'm sure it will be well worth it.

gaurav commented 3 years ago

Note that the CCDH Terminology Service can start the model generation workflow by using the repository_dispatch event.

joeflack4 commented 3 years ago

Thanks, Gaurav! I think I have two different implementations that should work. For one of those implementations, I think this dispatch event will prove most useful.

joeflack4 commented 3 years ago

Just adding here that the GH action workflow seems to be working just fine!

After the last CCDH Model pull request was merged, I set up the necessary repository secrets and ran the actions a few times, correcting any kinks I noticed. It seems to be working well now! I believe the next steps are fleshing out the refreshFromGoogleSheets and refreshModel actions in the CCDH Model repo. @gaurav I'll leave that to you, but if I'm needed for anything else, just let me know!