ImagingDataCommons / CloudSegmentator

Medical imaging segmentation workflows for FireCloud (Terra) and Seven Bridges Cancer Genomics Cloud
Apache License 2.0
3 stars 2 forks source link

Improve reproducibility of the workflows #44

Open fedorov opened 8 months ago

fedorov commented 8 months ago

We can also consider encoding the specific version of the workflow in the SEG/SR objects created, but I want to discuss this with David C first.

fedorov commented 8 months ago

workflows currently refer to the notebooks from "main" branch; instead we should use git tags and refer to tagged notebooks parameterize papermill with the variable to pass the tag, and refer to the tagged versions of the artifacts from the notebook

@vkt1414 I thought about this, and here's how I think we could do it:

  1. for non-tagged releases, just use "main"
  2. whenever we want to have a tagged release, rewrite all URLs to substitute "main" with the planned tag name, make a commit, and then tag that commit with the name that was used during rewriting URLs.
  3. on the following commit after the tagged one, rewrite the URLs again to use "main"
  4. instead of checking out individual artifacts from URLs in the notebooks, add a variable that would be initialized to "main" or tag name at the top of the notebook, check out the entire repository using main or tag, and refer to the files from the checked out repository tree in the following cells. This way we can get a bit close to versioning notebooks when they are executed outside of papermill.

If you have better ideas, let's discuss here before implementing!

vkt1414 commented 8 months ago

I do not have a complete plan yet but two things worry about the workflow you suggested. We use URLs at a lot of places. Links to image sources and config files in notebooks, and links to the notebooks in the wdl and cwl files. Chaning them manually I think poses a lot of room for errors, especially when I revisit the repo after a few months of no activity. Secondly, I'm wary about cloning the entire git repo as we will be running thousands of workflows in parallel and that would mean we rely even more on github. We never faced an issue so far with individual artifacts but I do not think it may be a good idea.

What I have in mind currently is to use a github action that can take care of updating the branch/tag references. We already have bits and pieces of this in several repos now (idc-index, browser, and tutorials) and it should not take a while to come up with one.

The gha should recognize tag and run a python script that will use a regex express to identify main branch and replace them with the tag. Otherwise, we always refer to main branch as you also suggested.

What are your thoughts?

fedorov commented 8 months ago

Chaning them manually I think poses a lot of room for errors

I agree - I didn't mean to do it manually, it can be done with a script.

What I have in mind currently is to use a github action that can take care of updating the branch/tag references. We already have bits and pieces of this in several repos now (idc-index, browser, and tutorials) and it should not take a while to come up with one.

For this kind of tasks, I think it is easier to have a script that can be run locally, and then whoever ran it an easily review all the changes and confirm that modifications make sense.