broadinstitute / pooled-cell-painting-profiling-recipe

:woman_cook: Recipe repository for image-based profiling of Pooled Cell Painting experiments
BSD 3-Clause "New" or "Revised" License
6 stars 4 forks source link

Refactor single cell files and arguments #14

Closed gwaybio closed 4 years ago

gwaybio commented 4 years ago

Two interconnected updates in this PR

In general, the philosophy of the last bullet point is summarized by: lets move all file and directory manipulation to the config file. Because data is flowing between scripts, it makes sense to define the outputs (and subsequently, the inputs!) only once.

I make these updates only in the 1.generate-profiles module.

ErinWeisbart commented 4 years ago

Am I understanding correctly:

So "site-level specific single cell files" means that there is a separate file made for each site and "one single cell file" (aka output_one_single_cell_file_only)means that a single file is created that has the information from all sites concatenated? And the latter wouldn't work with tons of sites because it is simply too large?

And so when you set output_one_single_cell_file_only to true, in the profiling config, it causes: 0.merge-single-cells to create only a single output file (not included in this PR?) 1.aggregate to load from that single file instead of iterating over sites

gwaybio commented 4 years ago

So "site-level specific single cell files" means that there is a separate file made for each site and "one single cell file" (aka output_one_single_cell_file_only)means that a single file is created that has the information from all sites concatenated? And the latter wouldn't work with tons of sites because it is simply too large?

This is exactly right. Plus, having the sites separate (and file paths defined in config_utils()) enables us to more easily perform experiments adjusting for site-to-site variability. Note that we can definitely still do this with a single, large single-cell file.

And so when you set output_one_single_cell_file_only to true, in the profiling config, it causes: 0.merge-single-cells to create only a single output file (not included in this PR?) 1.aggregate to load from that single file instead of iterating over sites

That's the idea! We're just creating the recipe, so no actual ingredients (data) are ever to be included in this repo.

ErinWeisbart commented 4 years ago

Thanks!

By "0.merge-single-cells to create only a single output file (not included in this PR?)" I meant that we still haven't made the necessary corresponding changes to 0.merge-single-cells?

Also, can you briefly explain why we want to have it as an option to have them all in the same file? What is the benefit?

gwaybio commented 4 years ago

I meant that we still haven't made the necessary corresponding changes to 0.merge-single-cells?

Ah, I see - we did make the changes (see 90a9c6cf07ef3713b12e1601dd069dc5f9a4dec5). In general, when making these refactoring changes, we want to not merge code that we know will break into master.

Also, can you briefly explain why we want to have it as an option to have them all in the same file? What is the benefit?

Its much easier to compute on, and distribute, a single file than hundreds of other smaller but equivalent-when-aggregated files. Our current image-based profiling pipeline works with single files.

There are definitely trade-offs though. Typically, when the single file gets HUGE, it becomes impossible to compute on the whole thing without fancy software that I don't have any experience with.