Initial DAG implementation

axiomcura commented 2 years ago

About

This PR demonstrates the initial DAG implemented into Cytopipe. The DAG is currently titled as preprocessing.smk and it is located within the rules/ directory.

Currently, the configuration file is present in the pull request but it is not useable due to requiring additional discussion in determining the structure of the config file and the setting of default parameters.

File structures

Each component of the workflow has their own designated directory:

rules/ --> Contains all the DAG files
configuration/ --> Contains configurational files in YAML format
scripts/ --> contains all the scripts used for analysis
data/ --> Directory where all the data is stored (.sqlite files not included)
envs/ --> YAML files containing conda environments and the appropriate software stack
debugging/ --> Contains notebooks that simulates that bug and displays potential solutions

The names of the directories are up for change @gwaygenomics .

Preprocessing.smk DAG

The preprocessing DAG contains 3 rules and produces 4 output files. The expected output files are displayed within the Snakefile.

output:


include: "rules/preprocessing.smk"

# Snakefile 
rule all:
    input:
        # expected outputs from the first DAG "Preprocessing"
        expand("results/preprocessing/{plate_id}.aggregate.csv.gz", plate_id=PLATE_IDS),
        expand("results/preprocessing/{plate_id}.cell_counts.tsv", plate_id=PLATE_IDS),
        expand("results/preprocessing/{plate_id}_augmented.csv.gz", plate_id=PLATE_IDS),
        expand("results/preprocessing/{plate_id}.normalized.csv.gz", plate_id=PLATE_IDS)

In the preprocessing.smk DAG, there are 3 rules, which are:

Aggregate:
- Calls aggregate_cells.py script to aggregate single cell profiles into aggregated profiles
- produces cell count and aggregate output files with cell_counts and aggregate suffixes respectively
Annotate:
- Calls annotate.py script to annotate aggregated data by adding metadata
- produces annotated profile that is tagged with the augmented suffix
normalize:
- Normalizing the features within the augmented aggregated profiles using the normalize.py script
- Produces normalized dataset tagged with the normalized suffix

Unfortunately, some of the parameters are hard coded within the scripts. In the future, default parameters will be implemented the configuration file.

All the analysis executed in preprocessing.smk DAG is conducted in its own designated conda environment that contains its appropriate software stack and versions.

   # in preprocessing.smk
    conda:
        "../envs/cytominer_env.yaml"

This is to ensure that CytoPipe's analysis is reproducible and portability.

axiomcura commented 2 years ago

Hello @gwaygenomics !

submitted a commit: https://github.com/WayScience/CytoPipe/commit/67828fb3af04336c529db4eebfd2c0f8f3958e04

It removes the files that are not apart of the workflow. There were mainly used for my for debugging purposes. (in this case the segfault debug)

Apologize for the confusion!

Also, I answered some of your questions in the comments section.

axiomcura commented 2 years ago

Alright @gwaygenomics I think that's all of them!

Some of them are left unresolved due to answering your questions, or awaiting for your reply. But all comments have been attended

Feel free to unresolved some threads if more discussion and/or changes are required!

axiomcura commented 2 years ago

Hello @gwaygenomics

I have attended all your suggestion and comments! Feel free to unresolved any threads if more changes or discussion is required!

Here is a comment that I have left of one of your suggestions: https://github.com/WayScience/CytoPipe/pull/1#discussion_r849788258

axiomcura commented 2 years ago

Hey @gwaygenomics

I ran some tests before merging and I found some small bugs. The changes have been applied.

It failed the "dry test" run: (dry test = only checking input and output files, no execution)

Building DAG of jobs...
MissingInputException in line 42 of /home/erikserrano/Projects/PR/CytoPipe/rules/preprocessing.smk:
Missing input files for rule annotate:
    output: results/preprocessing/SQ00014614_augmented.csv.gz
    affected files:
        results/preprocessing/SQ00014614_aggregated.csv.gz

it turns out that the bug was a file name problem (maybe this is what you were referring in https://github.com/WayScience/CytoPipe/pull/1#discussion_r846648552) . It turns out that _aggregated should have been _aggregate.

Another bug sourced from snakemakes expand() function. I assumed it will return a returned a list of inputs when injected into the python script but in fact it is a long string( example: input1.csv input2.csv input3.csv. Good news is that everything is separated by a white space therefore a simple .split() will do the job.

There also some formatting changes. I downloaded smkfmt which is the same as black but for snakemake files :)

The applied changed shows no error now!


Select jobs to execute...

[Thu Apr 14 10:17:28 2022]
rule annotate:
    input: data/barcode_platemap.csv, results/preprocessing/SQ00014614_aggregate.csv.gz
    output: results/preprocessing/SQ00014614_augmented.csv.gz
    jobid: 2
    resources: tmpdir=/tmp

Activating conda environment: .snakemake/conda/a5ae1f72f34b503771a449672ce6c5a2
Activating conda environment: .snakemake/conda/a5ae1f72f34b503771a449672ce6c5a2
['results/preprocessing/SQ00014614_aggregate.csv.gz']
Annotating profiles ...
[Thu Apr 14 10:17:32 2022]
Finished job 2.
1 of 3 steps (33%) done
Select jobs to execute...

[Thu Apr 14 10:17:32 2022]
rule normalize:
    input: results/preprocessing/SQ00014614_augmented.csv.gz
    output: results/preprocessing/SQ00014614_normalized.csv.gz
    jobid: 3
    resources: tmpdir=/tmp

Activating conda environment: .snakemake/conda/a5ae1f72f34b503771a449672ce6c5a2
Activating conda environment: .snakemake/conda/a5ae1f72f34b503771a449672ce6c5a2
[Thu Apr 14 10:17:35 2022]
Finished job 3.
2 of 3 steps (67%) done
Select jobs to execute...

[Thu Apr 14 10:17:35 2022]
localrule all:
    input: results/preprocessing/SQ00014614_aggregate.csv.gz, results/preprocessing/SQ00014614_cell_counts.tsv, results/preprocessing/SQ00014614_augmented.csv.gz, results/preprocessing/SQ00014614_normalized.csv.gz
    jobid: 0
    resources: tmpdir=/tmp

[Thu Apr 14 10:17:35 2022]
Finished job 0.
3 of 3 steps (100%) done

Bug has been fixed : 33462cdd2dd2a849ff3586fd5bab30b682e5a654 more typos fixed: bea4c80afb6032b14b96966fe661fa1d5f9b2bc9

gwaybio commented 2 years ago

great! Thanks for checkng this - feel free to merge!

WayScience / CytoSnake