WayScience / CytoSnake

Orchestrating high-dimensional cell morphology data processing pipelines
https://cytosnake.readthedocs.io
Creative Commons Attribution 4.0 International
3 stars 3 forks source link

Decoupling rule modules into individual components #33

Closed axiomcura closed 1 year ago

axiomcura commented 1 year ago

About this PR

This PR focuses on separating and rules into individual components. In the current version, there are some rules modules that contains multiple rules that conduct different process:

preocessing.smk

rule aggregate:
    input:
        sql_files=PLATE_DATA,
        barcodes=BARCODES,
        metadata=METADATA_DIR,
    output:
        aggregate_profile=AGGREGATE_DATA,
        cell_counts=CELL_COUNTS,
    log:
        "logs/aggregate_{file_name}.log",
    conda:
        "../envs/cytominer_env.yaml"
    params:
        aggregate_config=config["config_paths"]["single_cell"],
    script:
        "../scripts/aggregate_cells.py"

rule annotate:
    input:
        aggregate_profile=AGGREGATE_DATA,
        barcodes=BARCODES,
        metadata=METADATA_DIR,
    output:
        ANNOTATED_DATA,
    conda:
        "../envs/cytominer_env.yaml"
    log:
        "logs/annotate_{file_name}.log",
    params:
        annotate_config=config["config_paths"]["annotate"],
    script:
        "../scripts/annotate.py"

rule normalize:
    input:
        ANNOTATED_DATA,
    output:
        NORMALIZED_DATA,
    conda:
        "../envs/cytominer_env.yaml"
    log:
        "logs/normalized_{file_name}.log",
    params:
        normalize_config=config["config_paths"]["normalize"],
    script:
        "../scripts/normalize.py"

This is the preprocess.smk module that is currently implemented in Cyotsnake. Currently, the rules aggregate, annoate, and normalize are strongly linked because each rule expects outputs from the previous rules.

Creating a strong cohesion between rules will require developers to repeat the same code in their modules.

For example, since the normalize rule is deeply coupled within preprocess.smk, then the user will have to create another rule module that will contain the normalization process.

In addition, this makes rule modules non-extensible to major workflows. If you’re designing a major workflow and you require the normalization , then it will require the whole preprocess.smk to be imported to your workflow, which is not ideal, hence decoupling is a great solution to this problem.

Separating modules into individual components

Separating each rule into it’s own independent modules has it’s advantages. It will remove repeated code and increase extensibility.

Therefore it will look like this (decoupling preprocess.smk:

aggregate.smk

rule aggregate:
    input:
        sql_files=PLATE_DATA,
        barcodes=BARCODES,
        metadata=METADATA_DIR,
    output:
        aggregate_profile=AGGREGATE_DATA,
        cell_counts=CELL_COUNTS,
    log:
        "logs/aggregate_{file_name}.log",
    conda:
        "../envs/cytominer_env.yaml"
    params:
        aggregate_config=config["config_paths"]["single_cell"],
    script:
        "../scripts/aggregate_cells.py"

annotate.smk

rule annotate:
    input:
        aggregate_profile=AGGREGATE_DATA,
        barcodes=BARCODES,
        metadata=METADATA_DIR,
    output:
        ANNOTATED_DATA,
    conda:
        "../envs/cytominer_env.yaml"
    log:
        "logs/annotate_{file_name}.log",
    params:
        annotate_config=config["config_paths"]["annotate"],
    script:
        "../scripts/annotate.py"

normalize.smk

rule normalize:
    input:
        ANNOTATED_DATA,
    output:
        NORMALIZED_DATA,
    conda:
        "../envs/cytominer_env.yaml"
    log:
        "logs/normalized_{file_name}.log",
    params:
        normalize_config=config["config_paths"]["normalize"],
    script:
        "../scripts/normalize.py"

Now each components is individually, developers can import these modules to their workflows without any problem!

Another additional feature this PR introduces is the ability to inherit modules. Here’s an example: Let’s say we’re creating a new module but we also want this new module to create a tight couple with the normalization method.

This can be easily solved by inheriting the normalizatio.smk into your new rule module.

****new_rule.smk****

# lets inheret the normalization module 
include: `./normalization.smk`

rule new_rule:
     input:
        ANNOTATED_DATA,
    output:
        NORMALIZED_DATA,
    script:
         "../scripts/new_script.py"

the include is similar to python’s import call, as it imports the namespace into the new_rule.smk

This is beneficial because users do not have to write a new module or repeat code within the new_rule.smk

axiomcura commented 1 year ago

Great, merging!