gwaybio commented 4 years ago

In this issue, I will discuss results of step 3 outlined in https://github.com/broadinstitute/lincs-cell-painting/issues/22#issue-610266067

Note that this is copied and pasted from a notebook that will be added in a future pull request. Details in this notebook will guide our discussion of the results

Comparing Pycytominer and Cytominer Processing

We have previously processed all of the Drug Repurposing Hub Cell Painting Data using cytominer. Cytominer is an R based image-based profiling tool. In this repo, we reprocess the data with pycytominer. As the name connotes, pycytominer is a python based image-based profiling tool.

We include all processing scripts and present the pycytominer profiles in this open source repository. The repository represents a unified bioinformatics pipeline applied to all Cell Painting Drug Repurposing Profiles. In this notebook, we compare the resulting output data between the processing pipelines for the two tools: Cytominer and pycytominer.

We output several metrics comparing the two approaches

Metrics

In all cases, we calculate the element-wise absolute value difference between pycytominer and cytominer profiles.

Mean, median, and sum of element-wise differencs
Per feature mean, median, and sum of element-wise differences
Feature selection procedure differences per feature (level 4b only)

In addition, we confirm alignment of the following metadata columns:

Well
Broad Sample Name
Plate

Other metadata columns are not expected to be aligned. For example, we have updated MOA and Target information in the pycytominer version.

Data Levels

Image-based profiling results in the following output data levels. We do not compare all data levels in this notebook.

Data	Level	Comparison
Images	Level 1	NA
SQLite File (single cell profiles )	Level 2	NA
Aggregated Profiles with Well Information (metadata)	Level 3	Yes
Normalized Aggregated Profiles with Metadata	Level 4a	Yes
Normalized and Feature Selected Aggregated Profiles with Metadata	Level 4b	Yes
Perturbation Profiles created Summarizing Replicates	Level 5	No

gwaybio commented 4 years ago

Important Note: The plates and features are visualized in the same order in all figures.

EDIT: These comparisons were originally being made incorrectly. I was comparing cytominer DMSO normalization to pycytominer whole plate. See https://github.com/broadinstitute/lincs-cell-painting/issues/28#issuecomment-626919430 for the correct comparisons. In the edit I also add a collapsable tag to these out-of-date figures.

Summary

Click to show figures

![summary_metrics_full](https://user-images.githubusercontent.com/7340421/81506181-8ae81e00-92c2-11ea-907c-5fe8a15fe910.png) ![summary_metrics_zoom](https://user-images.githubusercontent.com/7340421/81506187-92a7c280-92c2-11ea-8fa7-2ac88ac04459.png) **Figure Legend:** The top plot above represents the complete summary within each plate. The bottom plot above zooms in to absolute difference < 10. Each point represents the mean/median/sum of absolute value differences comparing cytominer and pycytominer. I derived the metrics using every matrix entry per plate. There is one point per plate.

Interpretation

It is odd that a handful of plates sit around mean = 2.5; median = 2 in level 3 data. This level data is only derived from aggregating single cell profiles. In level 4a and level 4b data, the mean and median per plate difference sits around 0.5. In level 4a data, the extreme outliers in mean and sum are driven largely by only a few very large outliers (we don't see this in medians). These features are removed in the level 4b data. Some extreme outliers in the level 4a outliers may be explained by the handful of plates that are oddly different in level 3 data.

Within Level Summaries

Interpretation

On the surface, it looks like there are insurmountable differences between the two tools. However, upon a closer inspection, the differences are not as large as they seem. Indeed, there are examples of extremely large differences in a subset of the plates and a subset of the features. This view highlights the strange differences observed in a small subset of level 3 plates, and these plates continue to be different in level 4a data. However these differences appear to dissolve in level 4b indicating that they are likely to be explained by a relatively small subset of specific features. Likewise, the handful of examples of extremely large level 4a differences also appear largely resolved in the level 4b data. This observation further emphasizes the need to perform an independent round of feature selection when merging level 4a plates. It also begs the question if we need to develop different normalization strategies per feature since the inherent differences in feature distributions lead to differential responses to a single normalization strategy.

Level 3 - Aggregated Profiles

Click to show figures

![level_3_metrics_per_feature](https://user-images.githubusercontent.com/7340421/81506397-fda5c900-92c3-11ea-8c28-1bdea2ab3078.png) ![level_3_metrics_per_plate](https://user-images.githubusercontent.com/7340421/81506416-2037e200-92c4-11ea-991f-5ecc7af3d994.png) **Figure Legend:** Each point represents a per feature average within a single plate. The top plot above shows a per feature view, while the bottom plot shows a per plate view.

Level 4a - Aggregated and Normalized Profiles

Click to show figures

![level_4a_metrics_per_feature](https://user-images.githubusercontent.com/7340421/81506452-67be6e00-92c4-11ea-8173-f612d451ef3e.png) ![level_4a_metrics_per_plate](https://user-images.githubusercontent.com/7340421/81506461-7c026b00-92c4-11ea-8af3-3b4644b8cd98.png) **Figure Legend:** Each point represents a per feature average within a single plate. The top plot above shows a per feature view, while the bottom plot shows a per plate view.

Level 4b - Aggregated and Normalized Profiles

Click to show figures

![level_4b_metrics_per_feature](https://user-images.githubusercontent.com/7340421/81506495-b10ebd80-92c4-11ea-90df-acaebec4a939.png) ![level_4b_metrics_per_plate](https://user-images.githubusercontent.com/7340421/81506498-b409ae00-92c4-11ea-8356-e9237bf7bd9b.png) ![level_4b_all_across_well_summary](https://user-images.githubusercontent.com/7340421/81506508-c4ba2400-92c4-11ea-8918-98f267706380.png) **Figure Legend:** Each point represents a per feature average within a single plate. The top plot above shows a per feature view, while the middle plot shows a per plate view. The bottom plot represents the distribution of feature differences in total. Most differences are very near zero. ![feature_select_summary](https://user-images.githubusercontent.com/7340421/81506873-5aef4980-92c7-11ea-80ea-58ae9ae5b2a3.png) **Figure Legend:** Feature selection summary. This figure shows which features were selected, by plate, by each of the two tools.

gwaybio commented 4 years ago

Recommendation

We knew at the onset that aligning the profiles derived from each tool was going to be an arduous task. We knew that some cytominer-derived profiles were normalized differently (https://github.com/broadinstitute/lincs-cell-painting/issues/3#issuecomment-591994451), which also implies that different plates could have been processed at different times with different methods. We do not know the extent to which different plates have been processed differently with cytominer. We do know that all plates have been uniformly processed with the same pycytominer pipeline.

Given the consistency and documentation of the pycytominer pipeline, the cytominer processing differences, and the amount of time required to completely resolve these differences (we'd likely need to uniformly process all plates with a new cytominer pipeline as well), I think we should stick with the pycytominer profiles. This repo can exist under additional development after releasing version 1 profiles. If we find future issues with this data, or update the processing pipeline, there is nothing stopping us from releasing future versions.

gwaybio commented 4 years ago

The code to generate these results is provided in #29

cc'ing @niranjchandrasekaran @shntnu and @annecarpenter - any feedback/suggestions/comments are welcome!

shntnu commented 4 years ago

This represents an incredible amount of work on your part! 💯

It is odd that a handful of plates sit around mean = 2.5; median = 2 in level 3 data.

These very likely correspond to the 10 plates listed here https://github.com/broadinstitute/lincs-cell-painting/issues/3#issuecomment-591994451.

When creating the notebook, do leave those plates out.

In level 4a and level 4b data, the mean and median per plate difference sits around 0.5.

Can you test if it is related to m.a.d.

R https://stat.ethz.ch/R-manual/R-devel/library/stats/html/mad.html Python https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.median_absolute_deviation.html

shntnu commented 4 years ago

We knew that some cytominer-derived profiles were normalized differently (#3 (comment))

Ah – they were aggregated differently, not normalized differently

shntnu commented 4 years ago

Recommendation

Given the consistency and documentation of the pycytominer pipeline, the cytominer processing differences, and the amount of time required to completely resolve these differences (we'd likely need to uniformly process all plates with a new cytominer pipeline as well), I think we should stick with the pycytominer profiles. This repo can exist under additional development after releasing version 1 profiles. If we find future issues with this data, or update the processing pipeline, there is nothing stopping us from releasing future versions.

I'm on board with this plan! 💯

gwaybio commented 4 years ago

These very likely correspond to the 10 plates listed here #3 (comment). When creating the notebook, do leave those plates out.

level_3_metrics_per_plate_uniformplates

level_3_metrics_per_feature_uniformplates

Confirmed that those are the 10 plates. Some features (not all) still do have a bit of a wobble.

Ah – they were aggregated differently, not normalized differently

Ah! I got it - makes sense.

In level 4a and level 4b data, the mean and median per plate difference sits around 0.5.

Can you test if it is related to m.a.d.

Digging into this now. One thing I did quickly check is that both python and R implementations use the same scaling factor (constant = 1.4826).

Three other things to note in the analysis log linked in https://github.com/broadinstitute/lincs-cell-painting/issues/3#issuecomment-591994451.

The file shows normalization by DMSO control. For some reason, I thought that the cytominer profiles were whole-plate normalized. This could be a reason for the discrepancy!!
Also, the default cytominer_scripts/normalize is "robustize" but the default cytominer/normalize is "standardize".
- Sidenote: The version of cytominer/normalize used in processing the cytominer profiles was likely the August 15, 2016 version, which also uses "standardize". The February 23rd, 2017 cytominer_scripts/normalize also uses robustize.
I think the current implementation of cytominer_scripts/normalize samples profiles before normalization. (Maybe I am reading the code wrong though). Without knowing the seed used in the processing, there is no way to simulate this

shntnu commented 4 years ago

The file shows normalization by DMSO control. For some reason, I thought that the cytominer profiles were whole-plate normalized. This could be a reason for the discrepancy!!

My apologies for the confusion (in case I introduced it), but we did do a whole-plate normalization when creating profiles for the "pseudo" batch 2016_04_01_a549_48hr_batch1_cmap_style.

This was the script. There are two different sets of files on S3.

One corresponding to 2016_04_01_a549_48hr_batch1, which was robustize by DMSO control
The other corresponding to 2016_04_01_a549_48hr_batch1_cmap_style, which was robustize by the whole plate, but everything else upstream of normalization was identical to 2016_04_01_a549_48hr_batch1.

I think you are using 2016_04_01_a549_48hr_batch1. An easy way to check is to compute the median DMSO profile per plate; it should be all zeros (within some epsilon). Will do so.

For our notes: here are the two sets of files for a sample plate.

~$ aws s3 ls s3://imaging-platform-cold/imaging_analysis/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/plates/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad_2016_04_01_a549_48hr_batch1_SQ00014812_backend.tar.gz
2018-06-15 20:18:12 9778892027 2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad_2016_04_01_a549_48hr_batch1_SQ00014812_backend.tar.gz
~$ aws s3 ls s3://imaging-platform-cold/imaging_analysis/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/plates/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad_2016_04_01_a549_48hr_batch1_cmap_style_SQ00014812_backend.tar.gz
2018-06-19 16:05:52   31196865 2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad_2016_04_01_a549_48hr_batch1_cmap_style_SQ00014812_backend.tar.gz

shntnu commented 4 years ago

One corresponding to 2016_04_01_a549_48hr_batch1, which was robustize by DMSO control

The other corresponding to 2016_04_01_a549_48hr_batch1_cmap_style, which was robustize by the whole plate,

I can confirm this

See these two plots in the attached notebook

x2016_04_01_a549_48hr_batch1_cmap_style_SQ00015233_normalized_median_full_plate
x2016_04_01_a549_48hr_batch1_SQ00015233_normalized_median_dmso

Notebook resolving_normalization_issue.nb.html.zip

So if you use the Level 4a profiles from 2016_04_01_a549_48hr_batch1_cmap_style, pretty sure it will nearly identical to your output.

gwaybio commented 4 years ago

This could be a reason for the discrepancy!!

One corresponding to 2016_04_01_a549_48hr_batch1, which was robustize by DMSO control

The other corresponding to 2016_04_01_a549_48hr_batch1_cmap_style, which was robustize by the whole plate,

I can confirm this

Looks like we are on to something here

Comparing DMSO normalized plates 👇 (after removing those pesky 10 nonuniform plates)

summary_metrics_full_uniformplates

summary_metrics_zoom_uniformplates

level_4a_metrics_per_feature

level_4b_metrics_per_feature

Check out those axes! 👀

gwaybio commented 4 years ago

I am much more comfortable with these comparisons 👍 a couple of takeaways:

The feature selection comparisons are within floating point estimates, but there is some wobble. There is more wobble in the 4a profiles (before feature selection). I think this is why feature selection tends to work well in our cases.
- Our normalization procedures do not play nicely with all CellProfiler feature outputs.
- This will change experiment-to-experiment, and might even change between CellProfiler versions.
- CellProfiler features that are introduced with each analysis module could come alongside a metadata spreadsheet for data scientists that describes expected feature behavior and normalization strategies.
- I bet that a dynamic normalization scheme that automatically recognizes distributions of features and scales accordingly will give us (probably only moderate) performance boosts.
I am happy with this comparison and think that we should move forward with the pycytominer profiles for a version 1 release. Like I mentioned previously, we can always rerelease an updated version!

One additional thing I should do before merging is to compare the pycytominer feature selected (4b) to the cytominer normalized (4a). This will tell us if the pycytominer feature selection is removing noisy features well enough (currently the 4b plots are based on intersections). I think this is actually the last step.

shntnu commented 4 years ago

I am happy with this comparison and think that we should move forward with the pycytominer profiles for a version 1 release. Like I mentioned previously, we can always rerelease an updated version!

Agree

shntnu commented 4 years ago

I bet that a dynamic normalization scheme that automatically recognizes distributions of features and scales accordingly will give us (probably only moderate) performance boosts.

distributions of features at the aggregate level or single cell level?

gwaybio commented 4 years ago

distributions of features at the aggregate level or single cell level?

Aggregate level, but probably would work at single cell right? I haven't played with single cell yet so I don't know these distributions

gwaybio commented 4 years ago

One additional thing I should do before merging is to compare the pycytominer feature selected (4b) to the cytominer normalized (4a). This will tell us if the pycytominer feature selection is removing noisy features well enough (currently the 4b plots are based on intersections). I think this is actually the last step.

Comparing Cytominer 4a to Pyctyominer 4b

Results below. There are a couple discrepancies but not much more than we see in 4b intersection.

Click to show figures

![pycytominer_select_metrics_per_feature](https://user-images.githubusercontent.com/7340421/81614083-8482a080-93ad-11ea-89e3-d7d65ab77cd4.png) ![pycytominer_select_metrics_per_plate](https://user-images.githubusercontent.com/7340421/81614094-877d9100-93ad-11ea-9875-54ccc4420396.png)

gwaybio commented 4 years ago

closed by merging #29

broadinstitute / lincs-cell-painting

Comparing Cytominer and Pycytominer Profiles #28

Comparing Pycytominer and Cytominer Processing

Metrics

Data Levels

Summary

Interpretation

Within Level Summaries

Interpretation

Level 3 - Aggregated Profiles

Level 4a - Aggregated and Normalized Profiles

Level 4b - Aggregated and Normalized Profiles

Recommendation

Recommendation

Comparing Cytominer 4a to Pyctyominer 4b