WayScience / CytoSnake

Orchestrating high-dimensional cell morphology data processing pipelines
https://cytosnake.readthedocs.io
Creative Commons Attribution 4.0 International
3 stars 3 forks source link

Add barcode logic to CytoSnake's CLI #46

Closed axiomcura closed 1 year ago

axiomcura commented 1 year ago

About this PR

This PR adds CytoSnake to have logic when handling CLI user based inputs. Specifically, this update introduces barcode logic,

By default, a barcode is not required as an input to run CytoSnake; however, there are some exceptions when barcodes are needed.

If a user provides a dataset that has been generated from multiple experiments, then multiple plate maps are associated with the generated data. This will require a barcode file in order for CytoSnake to know which plate dataset is associated with which experiment.

What's new?

Screenshot 2023-05-03 at 10 02 41 PM

Barcode Logic Design. This image above present a diagram on how the barcode logic is handled within CytoSnake. In this example, we have 6 plate datasets that are seperated into two groups. Each group represents an experiment that has been conducted in order to generate the data. In addition, there is a metadata folder. where each platemap file is associated with the group of plate data. A) Demonstrates a user providing both plate data groups and a metadata file but fails to complete the init mode causing the raise barcode error to light up. B) Demonstrates a succesful init run where the user provides all the necessary inputs and makes the conduct init to light up.

What dictates the barcode logic is the number of plate maps found within the metadata data folder. If CytoSnake see's that there is more than 1 plate map, then it requires a barcode. Therefore, users must provide barcodes if multiple plate maps are present.

additional Notes

Changes in workflow

Update on workflows:

axiomcura commented 1 year ago

@d33bs Hopefully I have attended all your comments. Also, thanks for the great questions!

why are barcodes only required for multiple experiments (and not for single experiments)? For example, if two experiments need to be compared but they are processed individually, would we run into issues when attempting to compare things later on?

I will be used this repo to explain my current understanding.

From what I understand, the barcode file provides an assay-to-platename pairing. Where the assay plate are sqlite files (in this case) from cytominer-database. In the barcode file we see that there is an association between a specific plate map (Plate_Map_Name column) per assay., which contains metadata information that includes: well position, perturbations, etc. Looking at the barcode structure, there are some plate names that repeat 3 times (first 3, middle 3 and last 3) indicating that 3 experiments were conducted in triplicates. (Different plate map name = separate experiment)

Technically, there is no need to have a barcode for a single experiment because it will contain the same external factors among all plates (assuming that more than one plate was used in the experiment). The only time when barcodes are required is if 3 separate experiments were conducted on multiple plates. Therefore, the barcode will help find which plates have been involved with which experiment, thus mapping the correct metadata to those plates when conducting downstream analysis.

would we run into issues when attempting to compare things later on? Since the metadata (platemaps) can be incorporated within the single-cell / aggregate morphological profiles, you can stratify them based on experiments. The merging of the metadata to the morphology profiles is conducted by using the pcytominer's annotate where it requires both the profile and platemap as inputs.

However, one needs to map the correct assay with the associated platemap, which CytoSnake does when annotating multiple plate datasets (assays)

Similarly: are there ever scenarios where we don't have the barcode file but need to run analyses on the experiments? Here especially I'm thinking about previously gathered data where one may no longer have access to all data, or perhaps the data is stored in an unrecognizable format. In these scenarios could you simulate the barcode file's data (providing notation that it's simulated) to help facilitate the work involved with this PR?

The barcodes only provides information that distinguishes which plates came from which experiment. Assuming that the data you are talking about came from 3 separate experiments and no barcodes were provided. A potential solution is that we contact the person who generated this dataset and asks which plates came from which experiment.

However, with the scenario, if no plate maps were provided, then we will not know what types of external treatments were added to the cell and which experiments contained the types of treatments/cell lines used. Therefore, it will be difficult to simulate due to the lack of important metadata data like treatments, well positions, and cell lines used.

axiomcura commented 1 year ago

I have applied all the changes. Merging now. If there is more work need to be done, please feel free to re-open this PR.