cytomining / profiling-handbook

Image-based Profiling Handbook
https://cytomining.github.io/profiling-handbook/
Creative Commons Zero v1.0 Universal
9 stars 8 forks source link

Stated aggregation method (mean) is inconsistent with that used in other projects #53

Closed shntnu closed 2 years ago

shntnu commented 4 years ago

The handbook uses mean for aggregating (i.e. creating level 3) as well as for collapsing (i.e. creating level 5). However, in other projects / papers / software, we decided to use median. This is a major issue and should be resolved!

Some notes

  1. In the LINCS dataset, we decided https://github.com/broadinstitute/lincs-cell-painting/issues/3#issuecomment-591994451 to use medians: we were following the lead of what was done in the previously-processed version of the dataset in which we used median. When this project was first executed in 2017, cytominer_scripts used median as default; this was later changed here https://github.com/broadinstitute/cytominer_scripts/issues/18 (more on this below)
  2. pycytominer uses median by default for aggregation.
  3. In our 2018 experiments, we found that median performed better (that plot doesn't show mean).
  4. This PR https://github.com/broadinstitute/cytominer_scripts/issues/18 – Change default aggregation to be mean instead of median was to make cytominer_scripts/aggregate.R consistent with cytotools/aggregate.R. But it is unclear why cytotools (the new version of cytominer_scripts) used mean! I think this was because cytominer::aggregate used mean by default.
  5. Carmen Verdugo reported in June 2022: The profiling recipe does aggregation using median by default (here and here), while the profiling handbook uses collate.py for the creation of the sqlite and the aggregation, and here is the key: collate.py hard-coded mean by default (here).
shntnu commented 2 years ago

Moving an email conversation with @bethac07 to this thread

IIUC the profiling recipe should be updated to reflect our current practice of using mean instead of median

I agree with Beth that mean makes more sense at this level

The reason for that is - a lot of things may have partial penetrance (when using genetic reagents) or cause adjustments in things like cell cycle phase, which unless it affects 50% +1 cells, you'll never see that by a median aggregation. For any strong phenotype, it shouldn't matter much which you use, but one that's subtle and/or not universal we feel mean captures it better.

However we did notice this:

  • In our 2018 experiments, we found that median performed better (that plot doesn't show mean).

Nevertheless, I think our default should just be whatever we are now doing in http://github.com/jump-cellpainting/ and whatever the Cimini lab has been using as their default. Are we using mean or median in JUMP, @bethac07. Is that consistent with whatever C-lab uses for profiling datasets? If so, let's update the recipe to reflect this

bethac07 commented 2 years ago

The handbook uses mean for aggregating (i.e. creating level 3) as well as for collapsing (i.e. creating level 5).

The current handbook does not cover collapsing at all, so it does not take a position on the latter, but you're right that it does in the former.

In our 2018 experiments, we found that median performed better (that plot doesn't show mean)

I'm curious whether that held for both the compound experiments and TA-ORF, do you recall? Because naively, due to penetrance, I would suspect that median might fare better for compounds but mean better for genetic perturbations.

Nevertheless, I think our default should just be whatever we are now doing in http://github.com/jump-cellpainting/ and whatever the Cimini lab has been using as their default. Are we using mean or median in JUMP, @bethac07. Is that consistent with whatever C-lab uses for profiling datasets?

@ErinWeisbart can confirm, but I'm pretty sure at least for all of JUMP production she was using collate.py, which means she's been doing mean aggregation. Based on the config files for CPJUMP1/the pilots, the aggregation method is not set and so would be the default, so I guess it's median. We've been using mean as our default, but if that is based on me remembering the decision to move to mean incorrectly and thinking that was what should have been our default, we can certainly pivot to using median as the aggregation.

bethac07 commented 2 years ago
image

It's annoying that just mean alone isn't in this graph but to my eye-

Mean + PCA vs Median + PCA: Mean outperforms median in the bioactives + TA-ORF; they are equivalent in CDRP

Mean + Factor vs Median + Factor: The colors are maddeningly similar for this pair, and additionally it looks like the legend has 11 conditions and each plot has no more than 10, so they may never have been run head to head (or one has inadvertantly not been plotted).

So what specific data are we basing median outperforming mean on? It def isn't published, at least not in that paper.

shntnu commented 2 years ago

So what specific data are we basing median outperforming mean on? It def isn't published, at least not in that paper.

Darn, I tried digging through https://broadinstitute.atlassian.net/wiki/spaces/IP/pages/506658917/Experiments+-+Moment+based+profiling to find it, but no luck

I've asked Greg in case we did it in LINCS https://github.com/broadinstitute/lincs-cell-painting/issues/22#issuecomment-1155604288

The current handbook does not cover collapsing at all, so it does not take a position on the latter, but you're right that it does in the former.

Noted

I'm curious whether that held for both the compound experiments and TA-ORF, do you recall? Because naively, due to penetrance, I would suspect that median might fare better for compounds but mean better for genetic perturbations.

No luck with my digging, so I don't have an answer here unfortunately

@ErinWeisbart can confirm, but I'm pretty sure at least for all of JUMP production she was using collate.py, which means she's been doing mean aggregation. Based on the config files for CPJUMP1/the pilots, the aggregation method is not set and so would be the default, so I guess it's median.

Thanks for this background. So it sounds like there is a discrepancy between jump/source_4 (mean) and jump-pilot/source_4 (median) at the least?

We've been using mean as our default, but if that is based on me remembering the decision to move to mean incorrectly and thinking that was what should have been our default, we can certainly pivot to using median as the aggregation.

  1. the majority of the recent datasets have been processed using mean
  2. mean is the current default in the Cimini lab
  3. Even though https://github.com/cytomining/profiling-handbook/issues/53#issuecomment-1155224709 does not show mean vs median head-to-head, the PCA+mean and PCA+median should be indicative. PCA+mean > PCA+median

All this together makes me recommend mean as the default.

We might come up with more conclusive answers when we dig deeper into JUMP https://github.com/jump-cellpainting/develop-computational-pipeline/issues/57

ErinWeisbart commented 2 years ago

It is correct that for JUMP Production I used collate.py

shntnu commented 2 years ago

I'll resolve this thread now because we've concluded that mean is the aggregation method of choice.

As noted above, we might come up with more conclusive answers when we dig deeper into JUMP https://github.com/jump-cellpainting/develop-computational-pipeline/issues/57 and related efforts, and we can revisit these decisions then.

@niranjchandrasekaran note that there is a discrepancy between the CPJUMP1 and the JUMP production datasets; see https://github.com/cytomining/profiling-handbook/issues/53#issuecomment-1155612878 for details.

shntnu commented 2 years ago

Oh, and @bethac07 @ErinWeisbartL: please keep an eye out for this discrepancy; it's possible there are other places (other than the handbook+recipe) where this will need to be fixed.

I see that @carmendv has already updated this in https://github.com/broadinstitute/cellprofiler-on-Terra (/pull/40) so that's great 👍

AnneCarpenter commented 2 years ago

So do I understand correctly that we are choosing mean as our future default because that is what is used in JUMP production data? If so, that makes sense to me.

bethac07 commented 2 years ago

@AnneCarpenter Mean was used in jump-production, and we think mean is better. But all the JUMP pilots were done with median.