cytomining / profiling-recipe

Image-based Profiling Recipe
BSD 3-Clause "New" or "Revised" License
8 stars 26 forks source link

Recipe wish list #30

Open niranjchandrasekaran opened 2 years ago

niranjchandrasekaran commented 2 years ago

Since the profiling-recipe was finalized for JUMP the number of people who have interacted with the recipe has dramatically increased (JUMP members and the image analysts). Given that the recipe was always written and rewritten to satisfy the needs of the JUMP pilot experiments, I am surprised that it is robust and has, so far, not failed catastrophically. That doesn't mean that code is perfect. It needs a lot of work, particularly in

I will work on tidying up the code so that it is easier for others to read and contribute to the codebase.

Apart from the above, other changes also need to be made to the recipe because there have been several feature requests both from before and after the version for JUMP was frozen. These feature requests lie in the spectrum of requiring minor changes to the recipe to requiring major changes to recipe and pycytominer.

I have listed all the feature requests below, with some comments and a score for how easy or difficult, it will be to implement these (1 is easy and requires the least amount of time; 5 is difficult and requires the most amount of time).

Add feature analysis Difficulty: 3

Sample images Difficulty: 3

Calculate Replicate correlation and Percent Replicating Difficulty: 2

Beth mentioned this in https://github.com/cytomining/profiling-recipe/issues/29

Rename quality_control Difficulty: 1

Adding second order features Difficulty: 5

Adding dispersion order features Difficulty: 2

Adding replicate correlation feature selection as an option Difficulty: 4

Adding git, aws cli to the conda environment Difficulty: 1 Given that all the packages are installed using conda, it makes sense to add git and aws cli via conda as well. This is particularly helpful with ec2 instances that use outdated versions of git.

Set summary -> perform false Difficulty: 1 I realized that not all scopes generate load_data_csv files, which is required for the summary file to be generated. Hence, the default option in the config file for perform should be false.

Automatically create the plate information in the config files Difficulty:4 One of the most cumbersome tasks while running the recipe is to specify the names of all the batches and plates in the config file. If a user wants to run all the plates using a single config file, this information is already available in the barcode_platemap.csv file and could be added automatically to the config file. But the tricky part is making the script generic such that it can satisfy most users' needs.

Replace find and rsync steps Difficulty: 2 Currently, these two steps are necessary when aggregation is performed outside the recipe. These two steps compress the well-level aggregated profiles and then copy them to the profiles folder. This could be implemented in the recipe, saving the user the hassle of running these two steps.

Remove features = infer from normalize and feature select Difficulty: 1 This option exists so that the user can input their list of features instead of letting pycytominer infer the feature for the profiles. I don't see any user inputting thousands of feature names in the config file. I will remove this option from the config file and if a user wants to use their set of features, they can call pycytominer using their own script.

Profile annotation at the plate level Difficulty: 3 When multiple types of plates (treatment, control, etc) are run in a single batch, each type of file would need a different config file because the external_metadata file is specified for all the plates in a config file. Allowing the user to set the name of the external_metadata file at the plate level will allow them to run multiple types of plates in multiple batches using the same config file.

Setting site name at the plate/batch level Difficulty: 3 Currently, all the fields of view to aggregate have to be the same for all plates in a config file. If set at the plate level, then multiple plates with different FoVs to aggregate can be run together.

Setting input and output file names for each block Difficulty: 4 The order of operations (aggregation, annotation, normalization and feature selection) is done in a predetermined order because the output of one operation is the input of another. By specifying the names of the input and output files, it will be possible to run the operations in any order. Until we move over to a more powerful WDL-like setup for running the workflow, this would provide the functionality of running operations in any order. This would also allow adding new annotations to profiles without running normalization and feature selection, which was requested by Anne.

Greg mentions this in https://github.com/cytomining/profiling-recipe/issues/13

Here is some more context for the linear execution strategy - https://github.com/cytomining/profiling-recipe/issues/11

Make the normalize block more general Difficulty: 3 Currently, each type of normalization (whole plate and negcon) require different types of blocks (normalize and normalize_negcon). If the input and output names are allowed to be specified, then only a single type of block will be needed. The block will have a parameter to specify which type of normalization to perform (whole plate or negcon).

Combining collate.pyand the recipe Difficulty: 5 The recipe will greatly benefit from merging with collate.py because it could use collate.py's ability to run in parallel. collate.py might also benefit from the recipe because it will have a home :) and the user will be able to interact with it using the config file, instead of the command line. Also, the recipe and collate.py call the same pycytominer function, and it makes sense for the two are merged.

Create directories as part of a recipe step Difficulty: 1 https://github.com/cytomining/profiling-recipe/issues/8

Include consensus building as a recipe step Difficulty: 2 https://github.com/cytomining/profiling-recipe/issues/14

Now that you have made it through the list, there are a few questions that need to be answered

niranjchandrasekaran commented 2 years ago

cc @shntnu

bethac07 commented 2 years ago

Logging of how profiles were run, with what config, what commit of the recipe, what version/commit of pycytominer, any errors or warnings generated, etc.

a) logging is just good to do b) True thing that happened - I was trying to match a data repo's version of the recipe exactly. I checked out the listed commit, but it turns out the recipe had been updated AFTER the data in question had been generated, so my results were coming out differently. Since sometimes batches of a given project can come months apart, it is not at all impossible that this situation would happen again (batch 1 made with recipe commit X, months pass, recipe is updated because user needs new functionality for batch 2, batch 2 made with recipe commit Y, future person comes along and tries to reproduce batch 1 results with commit Y, cannot, because it was made with commit X). In this case, because there was a HUGE difference (feature selection happening at plate vs batch level), I caught it pretty fast, but a casual user might not catch it at all. As far as I can tell the only way to be sure currently is to jump the data repo back in time to the commit where the batch in question was added and then check the recipe commit welded in at that version; that's certainly better than never being able to tell EVER but is pretty advanced.

shntnu commented 2 years ago

@bethac07 I didn't fully understand https://github.com/cytomining/profiling-recipe/issues/30#issuecomment-1099671598

Logging of how profiles were run, with what config, what commit of the recipe, what version/commit of pycytominer, any errors or warnings generated, etc.

Are you saying this ^^^ should be added to the recipe wish list?

logging is just good to do

Didn't get this

batch 1 made with recipe commit X, months pass, recipe is updated because user needs new functionality for batch 2, batch 2 made with recipe commit Y, future person comes along and tries to reproduce batch 1 results with commit Y, cannot, because it was made with commit X

This is what bothers me the most about our approach here. For the system to be completely failproof, all the data in the data repo needs to be recreated every time the recipe is updated. This is a result of the fact that there's single recipe commit associated with all the data contained in the repo, but the data is often generated in batches.

Are you suggesting that a log file per batch would be a much better way of tracking provenance, instead of a git submodule linking to the recipe?

bethac07 commented 2 years ago

Are you saying this ^^^ should be added to the recipe wish list?

Yes; sorry, at the end of the list of to-dos there was a "are there any other feature requests?" question, and that is my current one

Are you suggesting that a log file per batch would be a much better way of tracking provenance, instead of a git submodule linking to the recipe?

I think we should do both things - I think that even though it makes it trickier to set the repos up, that welding in the recipe in is a good idea, and in many repos, once it's welded in there will not be a reason to update it. BUT I think when data is processed, we should drop a log into the log folder that says "on yyyy/mm/dd [XYZ] files were added to the "profiles" folder; processing was done with recipe commit {commit} and environment file version {stable link to environment file at current commit}" - so that in the cases where the recipe version weld IS updated at least once, it's easy to figure out which version of the recipe made each batch of profiles.

Does that clarify at all?

shntnu commented 2 years ago

Are you suggesting that a log file per batch would be a much better way of tracking provenance, instead of a git submodule linking to the recipe?

I think we should do both things - I think that even though it makes it trickier to set the repos up, that welding in the recipe in is a good idea, and in many repos, once it's welded in there will not be a reason to update it. BUT I think when data is processed, we should drop a log into the log folder that says "on yyyy/mm/dd [XYZ] files were added to the "profiles" folder; processing was done with recipe commit {commit} and environment file version {stable link to environment file at current commit}" - so that in the cases where the recipe version weld IS updated at least once, it's easy to figure out which version of the recipe made each batch of profiles.

Ah got it. So you are saying that we should do this manually. That sounds sensible to me.

shntnu commented 2 years ago

My addition to this wish list is https://github.com/cytomining/profiling-recipe/issues/12

shntnu commented 2 years ago

We discussed the current status of the recipe here

https://broadinstitute.slack.com/archives/C01AF25CQLT/p1666267827475639?thread_ts=1666197636.830589&cid=C01AF25CQLT

TODO someone: parse it into individual items to add to the wishlist.

I pulled out a part of it into https://github.com/cytomining/profiling-recipe/issues/38 but that might need to be split further

shntnu commented 1 year ago

Store minimal metadata in profile CSVs in favor of having a separate metadata file that we join each time. This is what we do in https://github.com/jump-cellpainting/dataset/ (see https://github.com/jump-cellpainting/jump-cellpainting/issues/141#issuecomment-1323860915 for a recent example of why this is important)

bethac07 commented 1 year ago

Add drop_outliers and the associated configuration parameter (difficulty: 1)

bethac07 commented 1 year ago

Make the specification of negcons more flexible (difficulty: 2-6, depending on how much more flexibility we add)

bethac07 commented 1 year ago

In annotate, add a Metadata_Batch column (difficulty: 1 or 2, I have to think) that appends the batch name given

bethac07 commented 1 year ago

More print statements during feature selection - at the annotation and normalization stage, you see per batch and plate what's happening, during feature selection, you only get a message when the step has started and then nothing until all batches are done. (difficulty: 1-2)

Also, looking at the list above:

Remove features = infer from normalize and feature select Difficulty: 1 This option exists so that the user can input their list of features instead of letting pycytominer infer the feature for the profiles. I don't see any user inputting thousands of feature names in the config file. I will remove this option from the config file and if a user wants to use their set of features, they can call pycytominer using their own script.

Definitely as our data sets start getting larger, this might be nice - maybe support instead for loading it in from a file? I agree inputting it in the config file is not ideal. I don't feel terribly strongly about this, just a nice-to-have.

bethac07 commented 1 year ago

When loading in plates for feature selection and selecting at the "batch" or "all" levels, concatenation happens for each plate - first you have A, then AB, then ABC, etc. We've gotten substantial speed improvements elsewhere by putting intermediate dfs into a list, then running concat once and only once - see aggregate compartment, which follows this strategy. Should help on large batches. Difficulty - 1

bethac07 commented 1 year ago
bethac07 commented 1 year ago

Building on this

Logging of how profiles were run, with what config, what commit of the recipe, what version/commit of pycytominer, any errors or warnings generated, etc.

Add a printout of conda list to the logs generated, not just the commit of pycytominer. Difficulty: 1-2