Open EchteRobert opened 2 years ago
As discussed during yesterday's check-in, I have computed Figure 4D as in the LINCS manuscript. I only have dose points 3.33 and 10 available. In general, we see that:
From plate SQ00015142 I inspected images from well B13, which is 10 uM sulfafurazole, and computed the same saliencies as before in the Stain data. I chose this well randomly and the plate based on the large size of the file. I tried inspecting SQ00015106 before, but the seeding was so sparse that picking the top and bottom saliency cells resulted in only a handful of cells in total. The seeding generally seems to be less dense than in the Stain experiments.
No conclusion can be drawn from these results because the high and low saliency cells are not consistent in their appearance.
From plate SQ00015131 I inspected images from well E13, which is 10 uM ganetespib and has HSP inhibitor as its MoA, and computed the same saliencies as before in the Stain data. This MoA is the one that relatively improved them most both for the 3.33 and 10 uM dose points when using model profiling versus average profiling.
I think now we can see that the green-outlined cells tend to be brighter/have stronger contrast in general than the red-outlined cells. We also see that (again) features that calculate the correlation between different channels are the most important for deciding which cells are the most or least important. IIUC, that means that cells which are very flat are not important and cells that are 'fat' in the depth dimension are more important. Then the question is: does it make sense that flat cells are less representative of the compound than fat cells? I wonder if you can see a similar more conclusive pattern here as well @AnneCarpenter?
@EchteRobert Do you happen to know what version of CellProfiler your features were made in? 3.X or 4.X? I don't know if in a way fatal to your analysis, but Costes features in CP3.X we realized as we were putting 4.0 together are improperly calculated
Ah, that's interesting @bethac07. According to the LINCS manuscript, it was version 2.3.1 so I'm guessing they were improperly calculated there as well. I'm wondering what the model is picking up then... Do you know how they are calculated exactly then?
My level of understanding from memory (which I cannot stress enough may be wrong) and a bit of digging is this - Costes measurements are a special case of the Manders coefficient (which involves looking at which part of an images that threshold positive in each of 2 channels), where in Costes that threshold is defined in a particular way. In at least CellProfiler 3, but possibly/probably also 2.3, there was an assumption that there were only 255 gray levels (numerical values), which is true in 8 bit images, but is wrong in 16 bit images (which these are) which have 65535 gray levels (numerical values). So the threshold was being set basically always to 255, which most of the image has a higher brightness than, so the calculated correlation coefficients were nearly always 1.
So basically, I think it was measuring "pixels brighter than 255"?
Great catch Beth! I checked out the values of those Costes Correlation features and they are indeed all equal (or almost equal) to 1. To get these features I just looked at which features resulted in the highest saliency values (absolute). I think there are two possible explanations as to why these features popped up as 'most salient':
Instead, I now calculated the correlation between saliency and feature values (something I also did before) and that points to different features, which I hope do have some actual meaning 😄 . Below are the results for this particular well for the different saliency scores.
Does this change the cells that would be green and red then? Seems like yes, happy to take another look.
By the way, for the colorblind you will eventually want to change to another color scheme. The wiki has a section that can help. -- Sent from my mobile phone
Does this change the cells that would be green and red then?
It could change them yes - and I think they did (but I don't have a lot of experience with analyzing cells by eye)
By the way, for the colorblind you will eventually want to change to another color scheme.
Yes, I will change that!
Model trained on 3.33 uM dose point.
3.33 uM | plate | mAP model | mAP BM | mAP filtered BM | mAP shuffled |
---|---|---|---|---|---|
all plates | 0.0456 | 0.0324 | 0.0323 | 0 |
10 uM | plate | mAP model | mAP BM | mAP filtered BM | mAP shuffled |
---|---|---|---|---|---|
all plates | 0.0475 | 0.034 | 0.034 | 0.0002 |
Here I trained a model on all data available from batch 1 in the LINCS dataset, which can be found like this:
aws s3 ls s3://cellpainting-gallery/cpg0004-lincs/broad/workspace/backend/2016_04_01_a549_48hr_batch1/
The model uses 1745 features, because of an issue with 10 plates (https://github.com/broadinstitute/lincs-cell-painting/issues/88#issuecomment-1249269257). In total, I trained the model on 136 plates, 5965 wells, including 1228 unique compounds using the 10 uM dose point. During preprocessing I removed 1587 wells due to missing MoA or compound name (pert_iname) annotation. I used the following hyperparameters:
I assess the model on the 10 uM dose point using replicate and MoA prediction and similarly on the 3.33 uM dose, which is considered the test set.
Results
Results 10 uM dose point
_Replicate prediction_ Welch's t-test between mlp mAP and bm mAP: Ttest_indResult(statistic=84.81208433212997, pvalue=0.0) | plate | Training mAP model | Training mAP BM | Training mAP shuffled | |:---------|---------------------:|------------------:|------------------------:| |all plates | 0.7473 | 0.269 | 0 | _MoA prediction_ Welch's t-test between mlp mAP and bm mAP: Ttest_indResult(statistic=6.753694914168434, pvalue=1.5518902810751288e-11) | plate | mAP model | mAP BM | mAP shuffled | |:---------|------------:|---------:|---------------:| | all plates | 0.0541 | 0.0338 | 0.0002 |Results 3.33 uM dose point
_Replicate prediction_ Welch's t-test between mlp mAP and bm mAP: Ttest_indResult(statistic=49.02599189522616, pvalue=0.0) | plate | Training mAP model | Training mAP BM | Training mAP shuffled | |:----------|---------------------:|------------------:|------------------------:| | all plates | 0.4465 | 0.1695 | 0 | _MoA prediction_ Welch's t-test between mlp mAP and bm mAP: Ttest_indResult(statistic=3.525483296865904, pvalue=0.0004250301209859708) | plate | mAP model | mAP BM | mAP shuffled | |:----------|------------:|---------:|---------------:| | all plates | 0.042 | 0.0322 | 0 |Loss curves
All plate names
SQ00014812_SQ00014813_SQ00014814_SQ00014815_SQ00014816_SQ00014817_SQ00014818_SQ00014819_SQ00014820_SQ00015041_SQ00015042_SQ00015043_SQ00015044_SQ00015045_SQ00015046_SQ00015047_SQ00015048_SQ00015049_SQ00015050_SQ00015051_SQ00015052_SQ00015053_SQ00015054_SQ00015055_SQ00015056_SQ00015057_SQ00015058_SQ00015059_SQ00015096_SQ00015097_SQ00015098_SQ00015099_SQ00015100_SQ00015101_SQ00015102_SQ00015103_SQ00015105_SQ00015106_SQ00015107_SQ00015108_SQ00015109_SQ00015110_SQ00015111_SQ00015112_SQ00015116_SQ00015117_SQ00015118_SQ00015119_SQ00015120_SQ00015121_SQ00015122_SQ00015123_SQ00015124_SQ00015125_SQ00015126_SQ00015127_SQ00015128_SQ00015129_SQ00015130_SQ00015131_SQ00015132_SQ00015133_SQ00015134_SQ00015135_SQ00015136_SQ00015137_SQ00015138_SQ00015139_SQ00015140_SQ00015141_SQ00015142_SQ00015143_SQ00015144_SQ00015145_SQ00015146_SQ00015147_SQ00015148_SQ00015149_SQ00015150_SQ00015151_SQ00015152_SQ00015153_SQ00015154_SQ00015155_SQ00015156_SQ00015157_SQ00015158_SQ00015159_SQ00015160_SQ00015162_SQ00015163_SQ00015164_SQ00015165_SQ00015166_SQ00015167_SQ00015168_SQ00015169_SQ00015170_SQ00015171_SQ00015172_SQ00015173_SQ00015194_SQ00015195_SQ00015196_SQ00015197_SQ00015198_SQ00015199_SQ00015200_SQ00015201_SQ00015202_SQ00015203_SQ00015204_SQ00015205_SQ00015206_SQ00015207_SQ00015208_SQ00015209_SQ00015210_SQ00015211_SQ00015212_SQ00015214_SQ00015215_SQ00015216_SQ00015217_SQ00015218_SQ00015219_SQ00015220_SQ00015221_SQ00015222_SQ00015223_SQ00015224_SQ00015229_SQ00015230_SQ00015231_SQ00015232_SQ00015233