Closed gwaybio closed 4 years ago
Important Note: The plates and features are visualized in the same order in all figures.
It is odd that a handful of plates sit around mean = 2.5; median = 2
in level 3 data. This level data is only derived from aggregating single cell profiles. In level 4a and level 4b data, the mean and median per plate difference sits around 0.5. In level 4a data, the extreme outliers in mean and sum are driven largely by only a few very large outliers (we don't see this in medians). These features are removed in the level 4b data. Some extreme outliers in the level 4a outliers may be explained by the handful of plates that are oddly different in level 3 data.
On the surface, it looks like there are insurmountable differences between the two tools. However, upon a closer inspection, the differences are not as large as they seem. Indeed, there are examples of extremely large differences in a subset of the plates and a subset of the features. This view highlights the strange differences observed in a small subset of level 3 plates, and these plates continue to be different in level 4a data. However these differences appear to dissolve in level 4b indicating that they are likely to be explained by a relatively small subset of specific features. Likewise, the handful of examples of extremely large level 4a differences also appear largely resolved in the level 4b data. This observation further emphasizes the need to perform an independent round of feature selection when merging level 4a plates. It also begs the question if we need to develop different normalization strategies per feature since the inherent differences in feature distributions lead to differential responses to a single normalization strategy.
We knew at the onset that aligning the profiles derived from each tool was going to be an arduous task. We knew that some cytominer-derived profiles were normalized differently (https://github.com/broadinstitute/lincs-cell-painting/issues/3#issuecomment-591994451), which also implies that different plates could have been processed at different times with different methods. We do not know the extent to which different plates have been processed differently with cytominer. We do know that all plates have been uniformly processed with the same pycytominer pipeline.
Given the consistency and documentation of the pycytominer pipeline, the cytominer processing differences, and the amount of time required to completely resolve these differences (we'd likely need to uniformly process all plates with a new cytominer pipeline as well), I think we should stick with the pycytominer profiles. This repo can exist under additional development after releasing version 1 profiles. If we find future issues with this data, or update the processing pipeline, there is nothing stopping us from releasing future versions.
The code to generate these results is provided in #29
cc'ing @niranjchandrasekaran @shntnu and @annecarpenter - any feedback/suggestions/comments are welcome!
This represents an incredible amount of work on your part! ๐ฏ
It is odd that a handful of plates sit around
mean = 2.5; median = 2
in level 3 data.
These very likely correspond to the 10 plates listed here https://github.com/broadinstitute/lincs-cell-painting/issues/3#issuecomment-591994451.
When creating the notebook, do leave those plates out.
In level 4a and level 4b data, the mean and median per plate difference sits around 0.5.
Can you test if it is related to m.a.d.
R https://stat.ethz.ch/R-manual/R-devel/library/stats/html/mad.html Python https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.median_absolute_deviation.html
We knew that some cytominer-derived profiles were normalized differently (#3 (comment))
Ah โย they were aggregated differently, not normalized differently
Recommendation
Given the consistency and documentation of the pycytominer pipeline, the cytominer processing differences, and the amount of time required to completely resolve these differences (we'd likely need to uniformly process all plates with a new cytominer pipeline as well), I think we should stick with the pycytominer profiles. This repo can exist under additional development after releasing version 1 profiles. If we find future issues with this data, or update the processing pipeline, there is nothing stopping us from releasing future versions.
I'm on board with this plan! ๐ฏ
These very likely correspond to the 10 plates listed here #3 (comment). When creating the notebook, do leave those plates out.
Confirmed that those are the 10 plates. Some features (not all) still do have a bit of a wobble.
Ah โ they were aggregated differently, not normalized differently
Ah! I got it - makes sense.
In level 4a and level 4b data, the mean and median per plate difference sits around 0.5.
Can you test if it is related to m.a.d.
Digging into this now. One thing I did quickly check is that both python and R implementations use the same scaling factor (constant = 1.4826).
Three other things to note in the analysis log linked in https://github.com/broadinstitute/lincs-cell-painting/issues/3#issuecomment-591994451.
cytominer_scripts/normalize
is "robustize" but the default cytominer/normalize
is "standardize".
cytominer/normalize
used in processing the cytominer profiles was likely the August 15, 2016 version, which also uses "standardize". The February 23rd, 2017 cytominer_scripts/normalize
also uses robustize.cytominer_scripts/normalize
samples profiles before normalization. (Maybe I am reading the code wrong though). Without knowing the seed used in the processing, there is no way to simulate this
- The file shows normalization by DMSO control. For some reason, I thought that the cytominer profiles were whole-plate normalized. This could be a reason for the discrepancy!!
My apologies for the confusion (in case I introduced it), but we did do a whole-plate normalization
when creating profiles for the "pseudo" batch 2016_04_01_a549_48hr_batch1_cmap_style
.
This was the script. There are two different sets of files on S3.
2016_04_01_a549_48hr_batch1
, which was robustize by DMSO control2016_04_01_a549_48hr_batch1_cmap_style
, which was robustize by the whole plate, but everything else upstream of normalization was identical to 2016_04_01_a549_48hr_batch1
.I think you are using 2016_04_01_a549_48hr_batch1
. An easy way to check is to compute the median DMSO profile per plate; it should be all zeros (within some epsilon). Will do so.
For our notes: here are the two sets of files for a sample plate.
~$ aws s3 ls s3://imaging-platform-cold/imaging_analysis/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/plates/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad_2016_04_01_a549_48hr_batch1_SQ00014812_backend.tar.gz
2018-06-15 20:18:12 9778892027 2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad_2016_04_01_a549_48hr_batch1_SQ00014812_backend.tar.gz
~$ aws s3 ls s3://imaging-platform-cold/imaging_analysis/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/plates/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad_2016_04_01_a549_48hr_batch1_cmap_style_SQ00014812_backend.tar.gz
2018-06-19 16:05:52 31196865 2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad_2016_04_01_a549_48hr_batch1_cmap_style_SQ00014812_backend.tar.gz
- One corresponding to
2016_04_01_a549_48hr_batch1
, which was robustize by DMSO control- The other corresponding to
2016_04_01_a549_48hr_batch1_cmap_style
, which was robustize by the whole plate,
I can confirm this
See these two plots in the attached notebook
x2016_04_01_a549_48hr_batch1_cmap_style_SQ00015233_normalized_median_full_plate
x2016_04_01_a549_48hr_batch1_SQ00015233_normalized_median_dmso
Notebook resolving_normalization_issue.nb.html.zip
So if you use the Level 4a profiles from 2016_04_01_a549_48hr_batch1_cmap_style
, pretty sure it will nearly identical to your output.
This could be a reason for the discrepancy!!
- One corresponding to
2016_04_01_a549_48hr_batch1
, which was robustize by DMSO control- The other corresponding to
2016_04_01_a549_48hr_batch1_cmap_style
, which was robustize by the whole plate,I can confirm this
Looks like we are on to something here
Comparing DMSO normalized plates ๐ (after removing those pesky 10 nonuniform plates)
Check out those axes! ๐
I am much more comfortable with these comparisons ๐ a couple of takeaways:
One additional thing I should do before merging is to compare the pycytominer feature selected (4b) to the cytominer normalized (4a). This will tell us if the pycytominer feature selection is removing noisy features well enough (currently the 4b plots are based on intersections). I think this is actually the last step.
- I am happy with this comparison and think that we should move forward with the pycytominer profiles for a version 1 release. Like I mentioned previously, we can always rerelease an updated version!
Agree
- I bet that a dynamic normalization scheme that automatically recognizes distributions of features and scales accordingly will give us (probably only moderate) performance boosts.
distributions of features at the aggregate level or single cell level?
distributions of features at the aggregate level or single cell level?
Aggregate level, but probably would work at single cell right? I haven't played with single cell yet so I don't know these distributions
One additional thing I should do before merging is to compare the pycytominer feature selected (4b) to the cytominer normalized (4a). This will tell us if the pycytominer feature selection is removing noisy features well enough (currently the 4b plots are based on intersections). I think this is actually the last step.
Results below. There are a couple discrepancies but not much more than we see in 4b intersection.
closed by merging #29
In this issue, I will discuss results of step 3 outlined in https://github.com/broadinstitute/lincs-cell-painting/issues/22#issue-610266067
Note that this is copied and pasted from a notebook that will be added in a future pull request. Details in this notebook will guide our discussion of the results
Comparing Pycytominer and Cytominer Processing
We have previously processed all of the Drug Repurposing Hub Cell Painting Data using cytominer. Cytominer is an R based image-based profiling tool. In this repo, we reprocess the data with pycytominer. As the name connotes, pycytominer is a python based image-based profiling tool.
We include all processing scripts and present the pycytominer profiles in this open source repository. The repository represents a unified bioinformatics pipeline applied to all Cell Painting Drug Repurposing Profiles. In this notebook, we compare the resulting output data between the processing pipelines for the two tools: Cytominer and pycytominer.
We output several metrics comparing the two approaches
Metrics
In all cases, we calculate the element-wise absolute value difference between pycytominer and cytominer profiles.
In addition, we confirm alignment of the following metadata columns:
Other metadata columns are not expected to be aligned. For example, we have updated MOA and Target information in the pycytominer version.
Data Levels
Image-based profiling results in the following output data levels. We do not compare all data levels in this notebook.