Clinical-Genomics / BALSAMIC

Bioinformatic Analysis pipeLine for SomAtic Mutations In Cancer
https://balsamic.readthedocs.io/
MIT License
45 stars 16 forks source link

Refactor include rules in smk suggestion #1228

Closed mathiasbio closed 11 months ago

mathiasbio commented 1 year ago

Description

Right now we're including rules in the PON.smk and QC.smk separately from the ones defined in the snakemake_rules dict in rules.py, and the way the rules are included in the SMKs are a bit messy, such as code like this:

if config["analysis"]["analysis_workflow"] == "balsamic":
    rules_to_include = [rule for rule in rules_to_include if "umi" not in rule]

And some rules are included even though they are not used, such as dragen_dna.rule.

I think it could be nice to clean this up a bit, and create some function where the rules can be extracted based on the analysis tags in the sample config. Such as the below example as a placeholder.

Suggested solution

class SnakemakeRules:
    """Class to extract relevant rules for provided tags."""

    def __init__(self, snakemake_rules_dict: Dict[str, Dict[str, List[str]]]):
        self.snakemake_rules_dict = snakemake_rules_dict

    def get_rules_by_tags(self, sequencing_type, analysis_type, workflow) -> List[str]:
        rules_to_include = []
        # Only keep rules where sequencing_type, analysis_type, workflow exists in each section of the include in dict
        return rules_to_include

snakemake_rules_dict: Dict = {
    "concatenate": {
        "path": "snakemake_rules/concatenation/concatenation.rule",
        "include_in": {
            "sequencing_type": [SequencingType.WGS, SequencingType.TARGETED],
            "analysis_type": [AnalysisType.SINGLE],
            "workflow": [WorkflowSolution.DRAGEN]
        }
    },
    "fastp": {
        "path": "snakemake_rules/quality_control/fastp.rule",
        "include_in": {
            "sequencing_type": [SequencingType.WGS, SequencingType.TARGETED],
            "analysis_type": [AnalysisType.SINGLE, AnalysisType.PAIRED, AnalysisType.PON],
            "workflow": [AnalysisWorkflow.BALSAMIC, AnalysisWorkflow.BALSAMIC_QC, AnalysisWorkflow.BALSAMIC_UMI]
        }
    },
    "fastqc": {
        "path": "snakemake_rules/quality_control/fastqc.rule",
        "include_in": {
            "sequencing_type": [SequencingType.WGS, SequencingType.TARGETED],
            "analysis_type": [AnalysisType.SINGLE, AnalysisType.PAIRED],
            "workflow": [AnalysisWorkflow.BALSAMIC, AnalysisWorkflow.BALSAMIC_QC, AnalysisWorkflow.BALSAMIC_UMI]

This can be closed when:

Describe what needs to be done for this issue to be closed

Blocked by

If there are any blocking issues/prs/things in this or other repos. Please link to them.

Before submitting

ivadym commented 11 months ago

Included in: https://github.com/Clinical-Genomics/BALSAMIC/issues/1343