Update for how we really run things

bethac07 commented 3 years ago

Adds an overview section that even non-DCP users can follow, as well as (brief) instructions along the way for non-Phenix users.

bethac07 commented 3 years ago

Currently, this draft needs addition of profile creation step recommendations- I still need to figure out how to handle specifically the SQLite creation step. Someone (presumably Niranj) can then add the post-aggregation step (with the recipe set to aggregate SQLites into per-well CSVs or not based on whether or not we're keeping cytominer-scripts below)

Running it with cytominer-scripts is one option, but as of now it won't work in the updated AMI (and the new pe2loaddata doesn't work on the old AMI, we tried but it's REALLY hard to get python3.8 on there), so we could
- send the AMI to each partner (which we CAN do, it's trivial, but we should consider if that's how we want to go, since it's kind of annoying to need a machine that just does literally that one step).
- We can try to figure out why cytominer_scripts is being annoying about making backends on an updated Ubuntu 18 AMI- I have sunk a couple hours into this without succeeding, but it's possible someone with R experience could do it faster. In that case, we can update the AMI to add R back in and then keep the instructions here more-or-less the same.
We can write a new non-R script that does all the valuable things that collate.R does, but just not in R- ie handling automatic downloading of the files, creation of the database, indexing of the database, upload of the files, deletion of the temp files so that if you need to do multiple rounds, you can. This is fine and possibly less annoying than cytominer_scripts, but we do need to host it somewhere (unless we make it, say, part of the recipe).

shntnu commented 3 years ago

@bethac07 Hooray! And goodbye cellpainting_scripts!

I'm very much in favor of this option:

We can write a new non-R script that does all the valuable things that collate.R does, but just not in R- ie handling automatic downloading of the files, creation of the database, indexing of the database, upload of the files, deletion of the temp files so that if you need to do multiple rounds, you can. This is fine and possibly less annoying than cytominer_scripts, but we do need to host it somewhere (unless we make it, say, part of the recipe).

At first, I thought this can live in pycytominer, perhaps in cyto_utils. A newcollate.py would download the CellProfiler ExportToSpreadsheet CSV files locally and then call cytominer-database on them, respecting the folder structure.

But I think this is too bespoke to live inside pycytominer. I think it should live inside profiling-recipe instead. @gwaygenomics and @niranjchandrasekaran can decide – let us know what you think, folks (and sorry for the weekend ping! Please do ignore until next week)

bethac07 commented 3 years ago

I'm happy to write it (or at least take an initial pass at it), if we think it should be a python script.

One thing to keep in mind though is that it is 100% mandatory for whatever this solution is to be able to be run in parallel since it takes 12-18 hours per plate. If the recipe currently doesn't handle running plates in parallel vs sequence (this is my understanding but not sure if it is true), the script needs to be executed separately from the rest of the recipe, at least for now.

shntnu commented 3 years ago

Good point.

By recipe, I actually meant Niranj’s gdoc.

But the script itself can live in the recipe repo, just that it won’t be run through the caller.

Please go ahead

On Sat, Jun 19, 2021 at 9:37 AM Beth Cimini @.***> wrote:

I'm happy to write it (or at least take an initial pass at it), if we think it should be a python script.

One thing to keep in mind though is that it is 100% mandatory for whatever this solution is to be able to be run in parallel since it takes 12-18 hours per plate. If the recipe currently doesn't handle running plates in parallel vs sequence (this is my understanding but not sure if it is true), the script needs to be executed separately from the rest of the recipe, at least for now.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cytomining/profiling-handbook/pull/59#issuecomment-864407518, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJHQPAPGR43MCRCLOBTYXDTTSMRFANCNFSM466QN4TA .

-- -Shantanu

gwaybio commented 3 years ago

We can write a new non-R script that does all the valuable things that collate.R does, but just not in R- ie handling automatic downloading of the files, creation of the database, indexing of the database, upload of the files, deletion of the temp files so that if you need to do multiple rounds, you can. This is fine and possibly less annoying than cytominer_scripts, but we do need to host it somewhere (unless we make it, say, part of the recipe).

But I think this is too bespoke to live inside pycytominer. I think it should live inside profiling-recipe instead.

Agree too bespoke for pycytominer, but I think it's a better option than the recipe. IMO the recipe shouldn't contain any processing code, just an instruction set (recipe) on how to process the ingredients (data).

Perhaps the way forward is to add it to pycytominer.cyto_utils for now, and then spin it off into a new repo once its more mature. Another option is to write it as a new tool from the start. I don't know all the details (next to none, actually), but it would be great in general if CellProfiler would make handling a lot of these downstream data hygiene tasks easier.

bethac07 commented 3 years ago

it would be great in general if CellProfiler would make handling a lot of these downstream data hygiene tasks easier

I mean, CellProfiler is absolutely capable of writing to a single SQLite file in the first place, but we would not be able to parallelize across the number of CPUs that we currently do - it's a choice in how we've decided to run the data. We could also choose to write to a central MySQL database, but a decision was made at some point not to- presumably due to hosting costs/hassle.

bethac07 commented 3 years ago

I've started a branch to do this in - https://github.com/cytomining/pycytominer/tree/jump

shntnu commented 3 years ago

We could also choose to write to a central MySQL database, but a decision was made at some point not to- presumably due to hosting costs/hassle.

Correct

shntnu commented 3 years ago

I don't know all the details (next to none, actually), but it would be great in general if CellProfiler would make handling a lot of these downstream data hygiene tasks easier.

The details are pretty simple – collate.R is a wrapper for

downloading CSV files locally (because they are usually on S3)
calling cytominer-database
cleaning

We wouldn't need collate if

the files are not on S3, or
we had a way to directly call cytominer-database on S3 objects

So it comes down to the cytominer-database rewrite :)

For now, the plan Beth has sounds sensible to keep things moving. But eventually, the rewrite is what will fix this issue. When we do that, we might discover that there are some simple changes that can be made in ExportToSpreadsheet to make the new cytominer-database / cytominer-transport easier to write e.g. storing the CSVs in a certain way so that it is easy to read them as a Parquet dataset.

bethac07 commented 3 years ago

@shntnu, initial pass for all my parts is complete. We'll want to make some edits if/when collate.py and its associated changes gets pulled, and once it's in a more readable format I'll have my team propose edits, but our part is complete.

We do need some public documentation of the profiling steps again, but if we're definitely not wanting to use cytominer_scripts anymore, I would say we should pull sooner rather than later and then add them as soon as we can.

shntnu commented 3 years ago

We do need some public documentation of the profiling steps again, but if we're definitely not wanting to use cytominer_scripts anymore, I would say we should pull sooner rather than later and then add them as soon as we can.

Agreed, let's merge! Please do so, just in case you still have some pending commits to push.

Tagging @niranjchandrasekaran so he is aware that we should plan to move the gdoc to this handbook at some point in the near future (but not urgent for JUMP because the gdoc exists).

cytomining / profiling-handbook

Update for how we really run things #59