Closed axiomcura closed 1 year ago
Hello @d33bs and @gwaybio
Hopefully I have covered all your comments! Ready for the next round!
Thank you for all your constructive comments!
@d33bs @gwaybio
Hopefully I have answered all your comments! Some comments brought interesting points that were added into the issues section.
I'd be happy to merge this PR right now If there aren't any objections.
you already have my approval 😄 - feel free to merge when you are happy with it.
Sounds good! Merging!
About
This PR introduces a new pathing system for
CytoSnake
. This PR is sections into multiple parts:The Motivation explains the whole purpose of this PR. Explains the issues there we encountered with the previous version and provide justification why this PR is necessary
The implementation approach provides a detailed description on how the issue was resolved. This will describe the types of software engineering approaches were conducted in order to solve this issue.
In the reviewer focuses section, reaches out to main reviewers on what to focus on. This allows reviewers were to focus on instead of trying to find on "what to do".
In the usage and assumptions sections provides a description on how
cytosnake
is used. Detailed explanations will be provided including what are the assumptions present when usingcytosnake
.Motivation
CytoSnake is a CLI tool that contains multiple reproducible workflows that analyzes cell morphology readouts.
Recently,
CytoSnake
has gone through it's first round of usage testing by @jenna-tomkinson and pointed out some major issues in regards ofCytoSnake's
strict naming scheme.CytoSnake
's workflows are written inSnakemake
, a popular workflow manager that is highly used in the bioinformatics community.Snakemake
is highly known for powerful and intuitive workflow design that allows for generating scalable, portable and reproducible workflows. However,Snakemake
's declarative naming scheme is very strict. This means that file containing different names will instantly cause the program to fail.Below is an example: In the
Snakemake
, we need to specify an input, output.The example above is what is known as a
rule
. This is the building block for aSnakemake
workflow. Arule
specifies a specific step within your workflow. It requires users to add an input, output and an executable (in this example it'sscript
) that will generate the output. In a complete workflow, you will see a series of rules.In this example, we are specifying the path to the
input
,output
and thescript
, which will generate the output.If we look closely, this
rule
will only work if theinput
file name exactly matches.Therefore, if the
metadata
folder was renamed toMetadata
, it will automatically fail despite providing the correct path.This is where the main issues that @jenna-tomkinson was having. Since the development of the
CytoSnake
was usedcell-health-data
, the expected naming scheme is identical to thecell-health-data
naming scheme.Implementation Approach
To solve this issue a dynamic pathing system was developed. What this means that we can pre-define the naming of the file before sending it to
Snakemake's
.Developing
_paths.yaml
is the star of the this implementation because it predefines paths before submitting intoCytoSnake
workflows. This attempts to solve the issue withSnakemake
's strict naming. Therefore, this removes the strict declaration of paths that one needs to do inSnakemake
workflows.Below are the contents of the
_paths.yaml
However, this creates another layer of complexity, which is implementing helper functions (next PR). The sole purpose of the helper functions is to declare paths dynamically into
Snakemake
workflows. Therefore, users do not have to worry about having a specific naming scheme in order for the workflows to successfully execute. These functions will interact with the_paths.yaml
file in order to make pathing declaration much more dynamic.Below is an example of how the
helper_functions
will interact with theconfigs.yaml
and_paths.yaml
In addition, this also allows to dynamically set names based on via extensions. One of the best practices of generating outputs names with extensions included. Extensions provide an idea what analysis was conducted within the workflows. For example:
The flexibility that the pathing implementation allows to declare input/output names automatically. This makes it much easier to declare input and output file names without the user actively renaming the paths declared under the
rules
within the workflow.Reviewer focus
Greg
Mainly focusing on the design of the implementation. Understanding the order of executing involved in creating a
Project Directory
. Mainly you will be focusing oncytosnake_setup.py
module. Here are some main things that I wProject Directory
practicalProject Direcotry
?Dave
Implementation focus. See how the implemented functions involved in the pathing follow best software development standard:
Mainly focus on:
cyto_paths.py
file_utils.py
typically the functions that are called in thecytosnake_setup.py
Here are some expectations for your part:
pre-commits
to format the whole code base.lists
rather thandict
Usage and assumption
Init mode
The
init
mode allows users to prepare the current directory into aproject directory
. Theinit
mode expect users to provide metadata folder and plate datasets, if done in replicates, then a barcode file must be added as well.An example command of using the
init
mode is:One can also use wildcards to declare multiple files as well.
Once a user inputs the required files, the current directory gets transformed into a
Project Drectory
. AProject Directory
allowsCytoSnake
to know that the files used to initialize in the current directory is being prepared for analysis.CytoSnake
uses the.cytosnake
directory as a landmark to know that a project is being conducting in the current directory. (similar to how git recognizes a directory as a repo by using.git
)The
.cytosnake
folder has two purposes:CytoSnake
know that the directory is a project directory_paths.yaml
file that providesCytoSnake
pathing information.There is more happening in the background when converting a current directory to a
Project Directory
. Assuming thatCytoSnake
has beenpip
installed, theinit
function makes a request to transfer the necessary files in order to conduct any analysis.If you look at the image above where it says
CytoSnake Package
, we see thatinit
mode makes a call in order to load in theconfigs
and theworkflow
folders.Run mode
Run mode allows
CytoSnake
to execute workflows found within theworkflows/
directory folder.Since
CytoSnake
already knows what the inputs are present due to the_paths.yaml
folder, users all need to do is type:Help mode
The help mode is executed by typing:
This will print out the whole CLI documentation with the three modes together.
If you are only interested reading documentation of one mode, you can simply type:
This will only print out the
help
documentation for only therun
mode