Closed gwaybio closed 4 years ago
Am I understanding correctly:
So "site-level specific single cell files" means that there is a separate file made for each site and "one single cell file" (aka output_one_single_cell_file_only
)means that a single file is created that has the information from all sites concatenated? And the latter wouldn't work with tons of sites because it is simply too large?
And so when you set output_one_single_cell_file_only
to true, in the profiling config, it causes:
0.merge-single-cells to create only a single output file (not included in this PR?)
1.aggregate to load from that single file instead of iterating over sites
So "site-level specific single cell files" means that there is a separate file made for each site and "one single cell file" (aka output_one_single_cell_file_only)means that a single file is created that has the information from all sites concatenated? And the latter wouldn't work with tons of sites because it is simply too large?
This is exactly right. Plus, having the sites separate (and file paths defined in config_utils()
) enables us to more easily perform experiments adjusting for site-to-site variability. Note that we can definitely still do this with a single, large single-cell file.
And so when you set output_one_single_cell_file_only to true, in the profiling config, it causes: 0.merge-single-cells to create only a single output file (not included in this PR?) 1.aggregate to load from that single file instead of iterating over sites
That's the idea! We're just creating the recipe, so no actual ingredients (data) are ever to be included in this repo.
Thanks!
By "0.merge-single-cells to create only a single output file (not included in this PR?)" I meant that we still haven't made the necessary corresponding changes to 0.merge-single-cells?
Also, can you briefly explain why we want to have it as an option to have them all in the same file? What is the benefit?
I meant that we still haven't made the necessary corresponding changes to 0.merge-single-cells?
Ah, I see - we did make the changes (see 90a9c6cf07ef3713b12e1601dd069dc5f9a4dec5). In general, when making these refactoring changes, we want to not merge code that we know will break into master.
Also, can you briefly explain why we want to have it as an option to have them all in the same file? What is the benefit?
Its much easier to compute on, and distribute, a single file than hundreds of other smaller but equivalent-when-aggregated files. Our current image-based profiling pipeline works with single files.
There are definitely trade-offs though. Typically, when the single file gets HUGE, it becomes impossible to compute on the whole thing without fancy software that I don't have any experience with.
Two interconnected updates in this PR
In general, the philosophy of the last bullet point is summarized by: lets move all file and directory manipulation to the config file. Because data is flowing between scripts, it makes sense to define the outputs (and subsequently, the inputs!) only once.
I make these updates only in the
1.generate-profiles
module.