Clinical-Genomics / hermes

Communication layer between CG and the pipelines.
https://clinical-genomics.github.io/hermes/
1 stars 0 forks source link

Homogenise tags and better define tag usage #108

Open ChrOertlin opened 5 months ago

ChrOertlin commented 5 months ago

Description

A question on slack brought up a discussion on usage of tags in housekeeper.

The question was whether to add samplesheet as a tag for samplesheets used in workflows. However, curretnyl samplesheet is "reserved" or "limited to" flow-cellsamplesheets. Currently the solution is to add a new tag using: nextflow-samplesheet.

Basically we are creating a new tag consisting of two tags, which to me seems counterintuitive. Ideally these should be two tags. nextflow and samplesheet. Furthermore, this pattern of tag-tag seems to exist for files like nextflow-config.

To do

Discuss the design patterns of tags, decide what to do, document and implement decision.

Some other points brought up

Especially with upcoming new technologies in production that possibly also require samplesheet´s (pacbio, ONT, Sephyr) and other files we likely need to introduce new tags and retrospectively alter illumina tags.

Example of inefficient tag usage / construction

The VariantTags in hermes - althought here might be some additional step involved that I do not fully understand, yet. image

diitaz93 commented 4 months ago

I think for the upcoming technologies makes sense to do ONT samplesheet and illumina samplesheet, but I think the workflow sample sheet is completely different from the sequencing sample sheet. It is just unfortunate that they have the same name. I think that putting the sequencing sample sheet and the workflow sample sheet under the same tag will mix up irreversibly these two file types.

ChrOertlin commented 4 months ago

I would argue that they while their purpose are different, both the samplesheet of the flow cells as well as the workflows are samplesheets. It is just a file that contains samples and sample metadata that is consumed in one way or another.

ChrOertlin commented 4 months ago

Points

  1. Workflows are using different approaches to identify files. Some use 'vcf', 'index' others 'vcf-index' to fetch similar files.
  2. Splitting tags has an upfront cost of knowing what tags to add
  3. Making unique tags is easer, however can lead to problem in point 1.

Decision

  1. Investigate tag usage of "not yet in production' workflows. (short term, @ivadym , @ChrOertlin )
  2. Investigate workflow files e.g. balsamic and identify if multiple more general tags are adequate enough to uniquely describe specific files (short term, @ivadym , @ChrOertlin )
  3. From 1. Setup a framework, done by either pipeline developers or system development that describes tags used for files. (short term, @ivadym , @ChrOertlin )
  4. Make the framework available to other so that it can be easily found and understood
  5. Implement future tags
  6. refactor past (long term)
  7. Setup Project
henrikstranneheim commented 4 months ago

If we use StrEnum and go for many tags we can do:

class Prerequisite (StrEnum)
  CONFIG = auto()
  SAMPLESHEET = auto()

which is more MERRy (Maintainable, Extendable, Readable and Robust) and pythonic.