carpenter-singh-lab / 2024_vanDijk_PLoS_CytoSummaryNet

1 stars 1 forks source link

P2 02. Final model for the LINCS dataset (batch 1) #13

Open EchteRobert opened 2 years ago

EchteRobert commented 2 years ago

Here I trained a model on all data available from batch 1 in the LINCS dataset, which can be found like this: aws s3 ls s3://cellpainting-gallery/cpg0004-lincs/broad/workspace/backend/2016_04_01_a549_48hr_batch1/

The model uses 1745 features, because of an issue with 10 plates (https://github.com/broadinstitute/lincs-cell-painting/issues/88#issuecomment-1249269257). In total, I trained the model on 136 plates, 5965 wells, including 1228 unique compounds using the 10 uM dose point. During preprocessing I removed 1587 wells due to missing MoA or compound name (pert_iname) annotation. I used the following hyperparameters:

Hyperparameter value
batch size 36
epochs 100
kFilters 0.5
latent dim 2048
learning rate 0.0005
nr cells (1500, 800)
nr sets 8
optimizer AdamW
output dim 2048
true batch size 288

I assess the model on the 10 uM dose point using replicate and MoA prediction and similarly on the 3.33 uM dose, which is considered the test set.

Results

Results 10 uM dose point _Replicate prediction_ Welch's t-test between mlp mAP and bm mAP: Ttest_indResult(statistic=84.81208433212997, pvalue=0.0) | plate | Training mAP model | Training mAP BM | Training mAP shuffled | |:---------|---------------------:|------------------:|------------------------:| |all plates | 0.7473 | 0.269 | 0 | _MoA prediction_ Welch's t-test between mlp mAP and bm mAP: Ttest_indResult(statistic=6.753694914168434, pvalue=1.5518902810751288e-11) | plate | mAP model | mAP BM | mAP shuffled | |:---------|------------:|---------:|---------------:| | all plates | 0.0541 | 0.0338 | 0.0002 |
Results 3.33 uM dose point _Replicate prediction_ Welch's t-test between mlp mAP and bm mAP: Ttest_indResult(statistic=49.02599189522616, pvalue=0.0) | plate | Training mAP model | Training mAP BM | Training mAP shuffled | |:----------|---------------------:|------------------:|------------------------:| | all plates | 0.4465 | 0.1695 | 0 | _MoA prediction_ Welch's t-test between mlp mAP and bm mAP: Ttest_indResult(statistic=3.525483296865904, pvalue=0.0004250301209859708) | plate | mAP model | mAP BM | mAP shuffled | |:----------|------------:|---------:|---------------:| | all plates | 0.042 | 0.0322 | 0 |
Loss curves Screen Shot 2022-10-05 at 11 19 38 AM Screen Shot 2022-10-05 at 11 20 15 AM
All plate names
EchteRobert commented 2 years ago

As discussed during yesterday's check-in, I have computed Figure 4D as in the LINCS manuscript. I only have dose points 3.33 and 10 available. In general, we see that:

figure4 (1)

EchteRobert commented 2 years ago

Interpretability analysis rerun for LINCS data

From plate SQ00015142 I inspected images from well B13, which is 10 uM sulfafurazole, and computed the same saliencies as before in the Stain data. I chose this well randomly and the plate based on the large size of the file. I tried inspecting SQ00015106 before, but the seeding was so sparse that picking the top and bottom saliency cells resulted in only a handful of cells in total. The seeding generally seems to be less dense than in the Stain experiments.

Main takeaways

No conclusion can be drawn from these results because the high and low saliency cells are not consistent in their appearance.

Images here! ![Screen Shot 2022-10-13 at 2 16 18 PM](https://user-images.githubusercontent.com/62173977/195594887-f9e29677-31d8-419a-bc9c-e563e8df058c.png) ![Screen Shot 2022-10-13 at 2 16 36 PM](https://user-images.githubusercontent.com/62173977/195594902-28ab4071-d7f9-4cd3-b9b3-2ac1b6c87760.png) ![Screen Shot 2022-10-13 at 2 17 07 PM](https://user-images.githubusercontent.com/62173977/195594904-ca5fdb4f-6563-47d4-beaf-42fc65983a17.png) ![Screen Shot 2022-10-13 at 2 16 55 PM](https://user-images.githubusercontent.com/62173977/195594907-7cfc813d-4dea-4d0b-a17f-bea0b92052e8.png) ![Screen Shot 2022-10-13 at 2 18 05 PM](https://user-images.githubusercontent.com/62173977/195594892-1f307c49-195d-4f52-b318-77b974bb2462.png) ![Screen Shot 2022-10-13 at 2 17 57 PM](https://user-images.githubusercontent.com/62173977/195594896-96affdba-a252-4588-934c-177528985678.png)
EchteRobert commented 2 years ago

Interpretability analysis rerun for LINCS data

From plate SQ00015131 I inspected images from well E13, which is 10 uM ganetespib and has HSP inhibitor as its MoA, and computed the same saliencies as before in the Stain data. This MoA is the one that relatively improved them most both for the 3.33 and 10 uM dose points when using model profiling versus average profiling.

Main takeaways

I think now we can see that the green-outlined cells tend to be brighter/have stronger contrast in general than the red-outlined cells. We also see that (again) features that calculate the correlation between different channels are the most important for deciding which cells are the most or least important. IIUC, that means that cells which are very flat are not important and cells that are 'fat' in the depth dimension are more important. Then the question is: does it make sense that flat cells are less representative of the compound than fat cells? I wonder if you can see a similar more conclusive pattern here as well @AnneCarpenter?

Images here! ![Screen Shot 2022-10-14 at 11 54 02 AM](https://user-images.githubusercontent.com/62173977/195818711-377c7369-c336-4cd6-80d8-996c9b0ad947.png) ![Screen Shot 2022-10-14 at 11 54 43 AM](https://user-images.githubusercontent.com/62173977/195818838-beaf2c5e-114f-42a8-8058-ad559eacea36.png) ![Screen Shot 2022-10-14 at 11 54 54 AM](https://user-images.githubusercontent.com/62173977/195818872-68f118bc-685b-45eb-afa4-ef307b90160a.png) ![Screen Shot 2022-10-14 at 11 55 05 AM](https://user-images.githubusercontent.com/62173977/195818938-9534b8b0-6969-4ba5-a749-13c6f4c3eb53.png) ![Screen Shot 2022-10-14 at 11 55 43 AM](https://user-images.githubusercontent.com/62173977/195819088-42175cc5-61b0-46fa-8fae-409475896fc8.png) ![Screen Shot 2022-10-14 at 11 56 17 AM](https://user-images.githubusercontent.com/62173977/195819186-cd4baedf-de60-4220-8651-1f15cc210de1.png) ![Screen Shot 2022-10-14 at 11 56 31 AM](https://user-images.githubusercontent.com/62173977/195819225-b7377bd8-3748-44dd-b347-f096b2c01a5d.png)
bethac07 commented 2 years ago

@EchteRobert Do you happen to know what version of CellProfiler your features were made in? 3.X or 4.X? I don't know if in a way fatal to your analysis, but Costes features in CP3.X we realized as we were putting 4.0 together are improperly calculated

EchteRobert commented 2 years ago

Ah, that's interesting @bethac07. According to the LINCS manuscript, it was version 2.3.1 so I'm guessing they were improperly calculated there as well. I'm wondering what the model is picking up then... Do you know how they are calculated exactly then?

bethac07 commented 2 years ago

My level of understanding from memory (which I cannot stress enough may be wrong) and a bit of digging is this - Costes measurements are a special case of the Manders coefficient (which involves looking at which part of an images that threshold positive in each of 2 channels), where in Costes that threshold is defined in a particular way. In at least CellProfiler 3, but possibly/probably also 2.3, there was an assumption that there were only 255 gray levels (numerical values), which is true in 8 bit images, but is wrong in 16 bit images (which these are) which have 65535 gray levels (numerical values). So the threshold was being set basically always to 255, which most of the image has a higher brightness than, so the calculated correlation coefficients were nearly always 1.

So basically, I think it was measuring "pixels brighter than 255"?

EchteRobert commented 2 years ago

Great catch Beth! I checked out the values of those Costes Correlation features and they are indeed all equal (or almost equal) to 1. To get these features I just looked at which features resulted in the highest saliency values (absolute). I think there are two possible explanations as to why these features popped up as 'most salient':

  1. Because the feature value 1 is relatively large for features (as I normalize all feature values within the plate). This explanation fits the L1 norm activation based saliency method the most.
  2. It also makes sense for the gradient-based saliency as this method is looking at features that when changed could influence the outcome a lot. Possibly, when most cells have a 1 for Costes Correlation and a few don't, these few cells would become very important for model prediction.

Instead, I now calculated the correlation between saliency and feature values (something I also did before) and that points to different features, which I hope do have some actual meaning 😄 . Below are the results for this particular well for the different saliency scores.

Main takeaways

Combined saliency score | Feature name | Correlation (Pearson) | |---------------------------------------|-----------------------| | Cytoplasm_Texture_SumAverage_RNA_10_0 | 0.381074 | | Cytoplasm_Texture_SumAverage_RNA_20_0 | 0.391287 | | Cytoplasm_Correlation_Manders_AGP_RNA | 0.408331 | | Nuclei_Texture_InfoMeas1_DNA_5_0 | 0.435283 | | Cytoplasm_Texture_SumEntropy_RNA_20_0 | 0.438746 | | Cytoplasm_Texture_Entropy_RNA_5_0 | 0.442981 | | Cytoplasm_Texture_SumEntropy_RNA_10_0 | 0.443986 | | Cytoplasm_Texture_Entropy_RNA_10_0 | 0.445096 | | Cytoplasm_Texture_SumEntropy_RNA_5_0 | 0.445393 | | Cytoplasm_Texture_Entropy_RNA_20_0 | 0.462262 | | Feature name | Correlation (Pearson) | |------------------------------------------------|-----------------------| | Cytoplasm_Texture_AngularSecondMoment_RNA_10_0 | -0.454882 | | Cytoplasm_Texture_AngularSecondMoment_RNA_5_0 | -0.451614 | | Cytoplasm_Texture_AngularSecondMoment_RNA_20_0 | -0.448406 | | Nuclei_Intensity_IntegratedIntensity_RNA | -0.445297 | | Nuclei_Intensity_IntegratedIntensity_Mito | -0.436190 | | Cells_Intensity_MaxIntensity_RNA | -0.432253 | | Nuclei_Intensity_MaxIntensity_RNA | -0.431139 | | Nuclei_Intensity_IntegratedIntensity_ER | -0.422684 | | Nuclei_Intensity_IntegratedIntensity_AGP | -0.413577 | | Nuclei_Texture_Correlation_DNA_5_0 | -0.382188 |
L1 norm activation saliency score | Feature name | Correlation (Pearson) | |----------------------------------------------|-----------------------| | Nuclei_Texture_SumAverage_DNA_5_0 | 0.461082 | | Nuclei_Granularity_1_Mito | 0.461331 | | Cytoplasm_Texture_SumEntropy_RNA_5_0 | 0.465899 | | Nuclei_RadialDistribution_MeanFrac_AGP_4of4 | 0.468284 | | Nuclei_RadialDistribution_MeanFrac_ER_4of4 | 0.468673 | | Nuclei_RadialDistribution_MeanFrac_Mito_4of4 | 0.471352 | | Cytoplasm_Texture_SumEntropy_RNA_20_0 | 0.474167 | | Cytoplasm_Texture_SumEntropy_RNA_10_0 | 0.477463 | | Nuclei_Intensity_LowerQuartileIntensity_DNA | 0.483370 | | Nuclei_Texture_InfoMeas1_DNA_5_0 | 0.500611 | | Feature name | Correlation (Pearson) | |---------------------------------------------|-----------------------| | Nuclei_Intensity_UpperQuartileIntensity_RNA | -0.567669 | | Nuclei_Intensity_StdIntensity_RNA | -0.546395 | | Nuclei_Intensity_MeanIntensity_RNA | -0.543263 | | Cells_Intensity_StdIntensity_RNA | -0.542904 | | Cells_Intensity_MaxIntensity_RNA | -0.542100 | | Nuclei_Intensity_MADIntensity_RNA | -0.540240 | | Nuclei_Intensity_MaxIntensity_RNA | -0.540037 | | Nuclei_Intensity_StdIntensity_AGP | -0.535854 | | Nuclei_Intensity_MADIntensity_AGP | -0.527600 | | Nuclei_Intensity_UpperQuartileIntensity_ER | -0.521866 |
Gradient analysis score | Feature name | Correlation (Pearson) | |---------------------------------------------------|-----------------------| | Cells_Texture_InverseDifferenceMoment_RNA_5_0 | -0.513449 | | Cytoplasm_Intensity_IntegratedIntensityEdge_AGP | -0.510165 | | Cells_Intensity_IntegratedIntensityEdge_AGP | -0.507586 | | Cytoplasm_AreaShape_MaximumRadius | -0.501875 | | Cells_Texture_InverseDifferenceMoment_RNA_10_0 | -0.500681 | | Cytoplasm_Intensity_IntegratedIntensityEdge_RNA | -0.497952 | | Cytoplasm_Texture_InverseDifferenceMoment_RNA_5_0 | -0.495278 | | Cytoplasm_AreaShape_MeanRadius | -0.493570 | | Cells_Texture_InverseDifferenceMoment_RNA_20_0 | -0.490450 | | Cells_AreaShape_MinorAxisLength | -0.488041 | | Feature name | Correlation (Pearson) | |---------------------------------------------|-----------------------| | Cells_Texture_DifferenceEntropy_RNA_10_0 | 0.492800 | | Cytoplasm_Texture_InfoMeas1_RNA_10_0 | 0.493645 | | Cells_Texture_Contrast_RNA_5_0 | 0.494087 | | Cytoplasm_Texture_DifferenceEntropy_RNA_5_0 | 0.494265 | | Cells_Texture_DifferenceEntropy_RNA_5_0 | 0.506087 | | Cells_Texture_InfoMeas1_DNA_5_0 | 0.513774 | | Cells_Texture_InfoMeas1_DNA_10_0 | 0.514939 | | Cells_Texture_InfoMeas1_RNA_10_0 | 0.519832 | | Cytoplasm_Texture_InfoMeas1_RNA_5_0 | 0.522585 | | Cells_Texture_InfoMeas1_RNA_5_0 | 0.541869 |
AnneCarpenter commented 2 years ago

Does this change the cells that would be green and red then? Seems like yes, happy to take another look.

By the way, for the colorblind you will eventually want to change to another color scheme. The wiki has a section that can help. -- Sent from my mobile phone

EchteRobert commented 2 years ago

Does this change the cells that would be green and red then?

It could change them yes - and I think they did (but I don't have a lot of experience with analyzing cells by eye)

By the way, for the colorblind you will eventually want to change to another color scheme.

Yes, I will change that!

EchteRobert commented 2 years ago

Model trained on 3.33 uM dose point.

3.33 uM plate mAP model mAP BM mAP filtered BM mAP shuffled
all plates 0.0456 0.0324 0.0323 0
10 uM plate mAP model mAP BM mAP filtered BM mAP shuffled
all plates 0.0475 0.034 0.034 0.0002