CytoSnake New Pathing implementation

About
Motivation
Implementation Approach
Reviewer focus
- Greg
- Dave
Usage and assumption

About

This PR introduces a new pathing system for CytoSnake. This PR is sections into multiple parts:

Motivation
Implementation Approach
Reviewer Focus
Usage and assumption

The Motivation explains the whole purpose of this PR. Explains the issues there we encountered with the previous version and provide justification why this PR is necessary

The implementation approach provides a detailed description on how the issue was resolved. This will describe the types of software engineering approaches were conducted in order to solve this issue.

In the reviewer focuses section, reaches out to main reviewers on what to focus on. This allows reviewers were to focus on instead of trying to find on "what to do".

In the usage and assumptions sections provides a description on how cytosnake is used. Detailed explanations will be provided including what are the assumptions present when using cytosnake.

Motivation

CytoSnake is a CLI tool that contains multiple reproducible workflows that analyzes cell morphology readouts.

Recently, CytoSnake has gone through it's first round of usage testing by @jenna-tomkinson and pointed out some major issues in regards of CytoSnake's strict naming scheme.

CytoSnake's workflows are written in Snakemake, a popular workflow manager that is highly used in the bioinformatics community. Snakemake is highly known for powerful and intuitive workflow design that allows for generating scalable, portable and reproducible workflows. However, Snakemake's declarative naming scheme is very strict. This means that file containing different names will instantly cause the program to fail.

Below is an example: In the Snakemake, we need to specify an input, output.

rule read_inputs:
    inputs: "path/path/metadata"
    output: "path/path/output.txt"
    script: "scripts/read_inputs.py"

The example above is what is known as a rule. This is the building block for a Snakemake workflow. A rule specifies a specific step within your workflow. It requires users to add an input, output and an executable (in this example it's script) that will generate the output. In a complete workflow, you will see a series of rules.

In this example, we are specifying the path to the input , output and the script, which will generate the output.

If we look closely, this rule will only work if the input file name exactly matches.

Therefore, if the metadata folder was renamed to Metadata, it will automatically fail despite providing the correct path.

This is where the main issues that @jenna-tomkinson was having. Since the development of the CytoSnake was used cell-health-data, the expected naming scheme is identical to the cell-health-data naming scheme.

Implementation Approach

To solve this issue a dynamic pathing system was developed. What this means that we can pre-define the naming of the file before sending it to Snakemake's .

Developing _paths.yaml is the star of the this implementation because it predefines paths before submitting into CytoSnake workflows. This attempts to solve the issue with Snakemake's strict naming. Therefore, this removes the strict declaration of paths that one needs to do in Snakemake workflows.

Below are the contents of the _paths.yaml

{
    "project_dir_path": "/home/erikserrano/Development/CytoSnake/testing",
    "project_dir": {
        "metadata": "/home/erikserrano/Development/CytoSnake/testing/metadata",
        "workflows": "/home/erikserrano/Development/CytoSnake/testing/workflows",
        ".cytosnake": "/home/erikserrano/Development/CytoSnake/testing/.cytosnake",
        "data": "/home/erikserrano/Development/CytoSnake/testing/data",
        "configs": "/home/erikserrano/Development/CytoSnake/testing/configs"
    },
    "config_dir": {
        "configuration": "/home/erikserrano/Development/CytoSnake/testing/configs/configuration.yaml",
        "analysis_configs": {
            "dp_aggregator_config": "/home/erikserrano/Development/CytoSnake/testing/configs/analysis_configs/dp_aggregator_config.yaml",
            "dp_data_configs": "/home/erikserrano/Development/CytoSnake/testing/configs/analysis_configs/dp_data_configs.yaml",
            "single_cell_configs": "/home/erikserrano/Development/CytoSnake/testing/configs/analysis_configs/single_cell_configs.yaml",
            "consensus_configs": "/home/erikserrano/Development/CytoSnake/testing/configs/analysis_configs/consensus_configs.yaml",
            "normalize_configs": "/home/erikserrano/Development/CytoSnake/testing/configs/analysis_configs/normalize_configs.yaml",
            "feature_select_configs": "/home/erikserrano/Development/CytoSnake/testing/configs/analysis_configs/feature_select_configs.yaml",
            "annotate_configs": "/home/erikserrano/Development/CytoSnake/testing/configs/analysis_configs/annotate_configs.yaml"
        }
    },
    "workflow_dir": {
        "workflow": {
            "cp_process": "/home/erikserrano/Development/CytoSnake/testing/workflows/workflow/cp_process",
            "dp_process": "/home/erikserrano/Development/CytoSnake/testing/workflows/workflow/dp_process"
        },
        "envs": {
            "cytominer_env": "/home/erikserrano/Development/CytoSnake/testing/workflows/envs/cytominer_env.yaml",
            "dp_process": "/home/erikserrano/Development/CytoSnake/testing/workflows/envs/dp_process.yaml"
        },
        "scripts": {
            "dp_aggregate": "/home/erikserrano/Development/CytoSnake/testing/workflows/scripts/dp_aggregate.py",
            "normalize": "/home/erikserrano/Development/CytoSnake/testing/workflows/scripts/normalize.py",
            "build_dp_consensus": "/home/erikserrano/Development/CytoSnake/testing/workflows/scripts/build_dp_consensus.py",
            "annotate": "/home/erikserrano/Development/CytoSnake/testing/workflows/scripts/annotate.py",
            "dp_normalize": "/home/erikserrano/Development/CytoSnake/testing/workflows/scripts/dp_normalize.py",
            "feature_select": "/home/erikserrano/Development/CytoSnake/testing/workflows/scripts/feature_select.py",
            "dp_build_consensus": "/home/erikserrano/Development/CytoSnake/testing/workflows/scripts/dp_build_consensus.py",
            "consensus": "/home/erikserrano/Development/CytoSnake/testing/workflows/scripts/consensus.py",
            "aggregate_cells": "/home/erikserrano/Development/CytoSnake/testing/workflows/scripts/aggregate_cells.py",
            "merge_logs": "/home/erikserrano/Development/CytoSnake/testing/workflows/scripts/merge_logs.py"
        },
        "rules": {
            "feature_select": "/home/erikserrano/Development/CytoSnake/testing/workflows/rules/feature_select.smk",
            "preprocessing": "/home/erikserrano/Development/CytoSnake/testing/workflows/rules/preprocessing.smk",
            "merge_logs": "/home/erikserrano/Development/CytoSnake/testing/workflows/rules/merge_logs.smk",
            "dp_process": "/home/erikserrano/Development/CytoSnake/testing/workflows/rules/dp_process.smk",
            "common_cp": "/home/erikserrano/Development/CytoSnake/testing/workflows/rules/common_cp.smk"
        }
    }
}

However, this creates another layer of complexity, which is implementing helper functions (next PR). The sole purpose of the helper functions is to declare paths dynamically into Snakemake workflows. Therefore, users do not have to worry about having a specific naming scheme in order for the workflows to successfully execute. These functions will interact with the _paths.yaml file in order to make pathing declaration much more dynamic.

Below is an example of how the helper_functions will interact with the configs.yaml and _paths.yaml

Helperfunctions

In addition, this also allows to dynamically set names based on via extensions. One of the best practices of generating outputs names with extensions included. Extensions provide an idea what analysis was conducted within the workflows. For example:

# common skeleton
{file_name}_{ext1}_{ext2}.csv

# example if config = zcore
SQ00123_norm_zscore.csv

# exaple if config=scAN
SQ00123_norm_scAN.csv

The flexibility that the pathing implementation allows to declare input/output names automatically. This makes it much easier to declare input and output file names without the user actively renaming the paths declared under the rules within the workflow.

Reviewer focus

Greg

Mainly focusing on the design of the implementation. Understanding the order of executing involved in creating a Project Directory. Mainly you will be focusing on cytosnake_setup.py module. Here are some main things that I w

Is the current design of setting up a current working directory into a Project Directory practical
Are there any potential assumptions that are not accounted in this implementation.
Is there a specific case where this implementation will fail when setting up the Project Direcotry?
Any parts of the design that may be unclear.
Spot any impracticalities.

Dave

Implementation focus. See how the implemented functions involved in the pathing follow best software development standard:

Mainly focus on: cyto_paths.py file_utils.py typically the functions that are called in the cytosnake_setup.py

Here are some expectations for your part:

Try not to focus too much on the formatting of the code. The next PR will be me manually executing pre-commits to format the whole code base.
pathing pitfalls (main focus) → where the pathing may fail in specific cases. For example creating paths that do not exists due to a user being at a different directory.
Documentation: see if it reflects the actual function.
Potential data structure pitfalls. For example: It is better to use lists rather than dict

Usage and assumption

CLI

Init mode

The init mode allows users to prepare the current directory into a project directory. The init mode expect users to provide metadata folder and plate datasets, if done in replicates, then a barcode file must be added as well.

An example command of using the init mode is:

cytosnake init -d plate_data1.sqlite plate_data2.sqlite -m metadata_dir -b barcodes.txt

One can also use wildcards to declare multiple files as well.

cytosnake init -d *.sqlite -m metadata_dir -b barcodes.txt

Once a user inputs the required files, the current directory gets transformed into a Project Drectory. A Project Directory allows CytoSnake to know that the files used to initialize in the current directory is being prepared for analysis. CytoSnake uses the .cytosnake directory as a landmark to know that a project is being conducting in the current directory. (similar to how git recognizes a directory as a repo by using .git)

The .cytosnake folder has two purposes:

To let CytoSnake know that the directory is a project directory
contains a _paths.yaml file that provides CytoSnake pathing information.

There is more happening in the background when converting a current directory to a Project Directory. Assuming that CytoSnake has been pip installed, the init function makes a request to transfer the necessary files in order to conduct any analysis.

If you look at the image above where it says CytoSnake Package , we see that init mode makes a call in order to load in the configs and the workflow folders.

Run mode

Run mode allows CytoSnake to execute workflows found within the workflows/ directory folder.

Since CytoSnake already knows what the inputs are present due to the _paths.yaml folder, users all need to do is type:

# executing the cp_process workflow
cytosnake run cp_process

# executing cp_rpocess worflow with 9 cores
cytosnake run cp_process -c 9

Help mode

The help mode is executed by typing:

cytosnake help

This will print out the whole CLI documentation with the three modes together.

If you are only interested reading documentation of one mode, you can simply type:

cytosnake run help

This will only print out the help documentation for only the run mode

WayScience / CytoSnake