Closed bethac07 closed 3 years ago
Currently, this draft needs addition of profile creation step recommendations- I still need to figure out how to handle specifically the SQLite creation step. Someone (presumably Niranj) can then add the post-aggregation step (with the recipe set to aggregate SQLites into per-well CSVs or not based on whether or not we're keeping cytominer-scripts below)
@bethac07 Hooray! And goodbye cellpainting_scripts
!
I'm very much in favor of this option:
- We can write a new non-R script that does all the valuable things that collate.R does, but just not in R- ie handling automatic downloading of the files, creation of the database, indexing of the database, upload of the files, deletion of the temp files so that if you need to do multiple rounds, you can. This is fine and possibly less annoying than cytominer_scripts, but we do need to host it somewhere (unless we make it, say, part of the recipe).
At first, I thought this can live in pycytominer
, perhaps in cyto_utils
. A newcollate.py
would download the CellProfiler ExportToSpreadsheet CSV files locally and then call cytominer-database
on them, respecting the folder structure.
But I think this is too bespoke to live inside pycytominer
. I think it should live inside profiling-recipe
instead. @gwaygenomics and @niranjchandrasekaran can decide – let us know what you think, folks (and sorry for the weekend ping! Please do ignore until next week)
I'm happy to write it (or at least take an initial pass at it), if we think it should be a python script.
One thing to keep in mind though is that it is 100% mandatory for whatever this solution is to be able to be run in parallel since it takes 12-18 hours per plate. If the recipe currently doesn't handle running plates in parallel vs sequence (this is my understanding but not sure if it is true), the script needs to be executed separately from the rest of the recipe, at least for now.
Good point.
By recipe, I actually meant Niranj’s gdoc.
But the script itself can live in the recipe repo, just that it won’t be run through the caller.
Please go ahead
On Sat, Jun 19, 2021 at 9:37 AM Beth Cimini @.***> wrote:
I'm happy to write it (or at least take an initial pass at it), if we think it should be a python script.
One thing to keep in mind though is that it is 100% mandatory for whatever this solution is to be able to be run in parallel since it takes 12-18 hours per plate. If the recipe currently doesn't handle running plates in parallel vs sequence (this is my understanding but not sure if it is true), the script needs to be executed separately from the rest of the recipe, at least for now.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cytomining/profiling-handbook/pull/59#issuecomment-864407518, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJHQPAPGR43MCRCLOBTYXDTTSMRFANCNFSM466QN4TA .
-- -Shantanu
We can write a new non-R script that does all the valuable things that collate.R does, but just not in R- ie handling automatic downloading of the files, creation of the database, indexing of the database, upload of the files, deletion of the temp files so that if you need to do multiple rounds, you can. This is fine and possibly less annoying than cytominer_scripts, but we do need to host it somewhere (unless we make it, say, part of the recipe).
But I think this is too bespoke to live inside pycytominer. I think it should live inside profiling-recipe instead.
Agree too bespoke for pycytominer, but I think it's a better option than the recipe. IMO the recipe shouldn't contain any processing code, just an instruction set (recipe) on how to process the ingredients (data).
Perhaps the way forward is to add it to pycytominer.cyto_utils
for now, and then spin it off into a new repo once its more mature. Another option is to write it as a new tool from the start. I don't know all the details (next to none, actually), but it would be great in general if CellProfiler would make handling a lot of these downstream data hygiene tasks easier.
it would be great in general if CellProfiler would make handling a lot of these downstream data hygiene tasks easier
I mean, CellProfiler is absolutely capable of writing to a single SQLite file in the first place, but we would not be able to parallelize across the number of CPUs that we currently do - it's a choice in how we've decided to run the data. We could also choose to write to a central MySQL database, but a decision was made at some point not to- presumably due to hosting costs/hassle.
I've started a branch to do this in - https://github.com/cytomining/pycytominer/tree/jump
We could also choose to write to a central MySQL database, but a decision was made at some point not to- presumably due to hosting costs/hassle.
Correct
I don't know all the details (next to none, actually), but it would be great in general if CellProfiler would make handling a lot of these downstream data hygiene tasks easier.
The details are pretty simple – collate.R
is a wrapper for
cytominer-database
We wouldn't need collate
if
cytominer-database
on S3 objectsSo it comes down to the cytominer-database
rewrite :)
For now, the plan Beth has sounds sensible to keep things moving. But eventually, the rewrite is what will fix this issue. When we do that, we might discover that there are some simple changes that can be made in ExportToSpreadsheet to make the new cytominer-database
/ cytominer-transport
easier to write e.g. storing the CSVs in a certain way so that it is easy to read them as a Parquet dataset.
@shntnu, initial pass for all my parts is complete. We'll want to make some edits if/when collate.py
and its associated changes gets pulled, and once it's in a more readable format I'll have my team propose edits, but our part is complete.
We do need some public documentation of the profiling steps again, but if we're definitely not wanting to use cytominer_scripts anymore, I would say we should pull sooner rather than later and then add them as soon as we can.
We do need some public documentation of the profiling steps again, but if we're definitely not wanting to use cytominer_scripts anymore, I would say we should pull sooner rather than later and then add them as soon as we can.
Agreed, let's merge! Please do so, just in case you still have some pending commits to push.
Tagging @niranjchandrasekaran so he is aware that we should plan to move the gdoc to this handbook at some point in the near future (but not urgent for JUMP because the gdoc exists).
Adds an overview section that even non-DCP users can follow, as well as (brief) instructions along the way for non-Phenix users.