EchteRobert commented 2 years ago

It is now clear that this feature aggregation model will only serve a certain feature set (meaning a certain dataset line), and is not developed to be able to aggregate any feature set (it is only invariant to the number of cells per well). I will start with creating a model that is able to beat the 'mean aggregation' baselines of the Stain2 batches, and then move forward to Stain3, Stain4, and finally use Stain5 as a final testset.

Because of that it would be ideal if all features across Stain datasets were the same. This is (somewhat) the case across Stain2, Stain3, and Stain4. However, Stain5 has a slightly different cellprofiler pipeline resulting in a different and larger feature set. During preprocessing I found that the pipeline from raw single-cell features to data that can directly be fed to the model, is quite a slow process. This is especially the case when all features are used (in this case 4295 for Stain 2-4 and 5794 for Stain 5). The model inference and training also becomes increasingly slower as the number of features increases. From the initial experiments on CPJUMP1 we saw that not all features are needed to create a better profile than the baseline (https://github.com/broadinstitute/FeatureAggregation_single_cell/issues/1). This is why I have chosen to select only all common features across Stain 2-5. This has the advantage of speed, both in preprocessing and inference, and compatibility, as no separate model will have to be trained to use Stain5 as the test set.

Assuming that the features across Stain2, Stain3, Stain4, and Stain5 are consistent within each experiment, there are 1324 features which are measured in all of them. The features are well distributed in terms of category: Cells: 441 features, Cytoplasm: 433 features, and Nuclei: 450 features. 1124 of them are decently uncorrelated (<abs(0.5) Pearsson correlation) [one plate tested]. From hereon these are the features that will be used to train the model.

EchteRobert commented 2 years ago

The Stain 2 experiment (https://github.com/jump-cellpainting/pilot-analysis/issues/15) contains 14 batches, of which only 1 will not be used to train the model. This is BR00112200 (Confocal) which contains less features than the other batches due to it missing the RNA channel. All other batches will be used to train or validate the model. See overview below:

Beautiful colours here!

_Note that the Percent Strong shown here is calculated with an additional sphering operation_ Screen Shot 2022-02-28 at 2 20 31 PM

_The Percent Strong/Replicating with feature selected features - no sphering_ | Description | Percent_Replicating | |:-----------------------|----------------------:| | BR00113818.csv | 51.1 | | BR00113819.csv | 51.1 | | BR00113821.csv | 51.1 | | BR00113820.csv | 56.7 | | BR00112198.csv | 55.6 | | BR00112204.csv | 63.3 | | BR00112199.csv | 58.9 | | BR00112200.csv | 63.3 | | BR00112201.csv | 70 | | BR00112197repeat.csv | 63.3 | | BR00112203.csv | 52.2 | | BR00112202.csv | 56.7 | | BR00112197binned.csv | 58.9 | | BR00112197standard.csv | 66.7 | _The Percent Strong/Replicating with the 1324 features as used by the model - **I will use this as the reference BM**_ | Description | Percent_Replicating | |:-----------------------|----------------------:| | BR00113818.csv | 52.2 | | BR00113819.csv | 48.9 | | BR00113821.csv | 47.8 | | BR00113820.csv | 55.6 | | BR00112198.csv | 56.7 | | BR00112204.csv | 58.9 | | BR00112199.csv | 57.8 | | BR00112201.csv | 66.7 | | BR00112197repeat.csv | 63.3 | | BR00112203.csv | 56.7 | | BR00112202.csv | 54.4 | | BR00112197binned.csv | 58.9 | | BR00112197standard.csv | 56.7 |

EchteRobert commented 2 years ago

Experiment 1

The first model is trained on BR00112197 binned, BR00112199 multiplane, and BR00112203 MitoCompare. These are the most distinct batches that could have been chosen, all other batches' features have values that contain more similar distributions. The training and validation loss curves indicate slow but steady learning and the model has not converged after 50 epochs. The PR will be calculated for each batch as a whole without the negative controls. The training data consists of 80% of each batch, meaning that the model has not seen the remaining 20% during training. The model will also be tested on a completely unseen batch.

Main Takeaways

The PR shows that the correlation between non-replicates is quite high, but the correlation between replicates is even higher. The model appears to cluster everything somewhat together, but still separates the replicates adequately. This might indicate that it does not even use the full latent feature space yet?
Robust MAD normalization pushes the non-replicates more around a zero distribution, however this is at the cost of the overall PR.
The model learns general aggregation methods, which also apply to a completely unseen batch: BR00113818 Redone.
Interestingly, the model performs slightly worse on the BR00112199 MultiPlane and BR00112197 binned batches, which it has partly seen during training, while it performs better on the BR00113818 Redone batch, which it has not yet seen before. The negative controls for these training plates have higher correlations than the BR00113818 Redone plate.

Conclusion

The model shows promise in learning general aggregation methods which can be applicable to unseen data, as long as the features remain constant. However, something unexpected is going on for the BR00112199 MultiPlane and BR00112197 binned batches. I will investigate whether these results are due to chance or something else is going on.

Results! Wooh!

_BR00112203 MitoCompare - training data_ ![Stain2_BR00112203_MitoCompare_PR](https://user-images.githubusercontent.com/62173977/156059037-cf34c8bb-472b-48a1-a041-8fccdeb3668b.png) _BR00112203 MitoCompare Robust MAD normalized features_ ![Stain2_BR00112203_MitoCompare_normalized_PR](https://user-images.githubusercontent.com/62173977/156059021-071ca8aa-58db-42eb-a89d-474c4c8baed1.png) _BR00112199 MultiPlane - training data_ ![Stain2_BR00112199_MultiPlane_PR](https://user-images.githubusercontent.com/62173977/156059012-7e5d3a03-3858-4bae-9c7a-958b79c3a739.png) _BR00112197 binned - training data_ ![Stain2_BR00112197binned_PR](https://user-images.githubusercontent.com/62173977/156059006-52028fb2-c620-45dd-8c4c-099a060622ae.png) _BR00113818 Redone - **not in training set**_ ![Stain2_BR00113818_Redone_PR](https://user-images.githubusercontent.com/62173977/156058989-e1ba9812-87ff-4729-9945-49d249a4b3ad.png)

EchteRobert commented 2 years ago

While trying to find the cause for the possible issue described in https://github.com/broadinstitute/FeatureAggregation_single_cell/issues/5#issuecomment-1054601450, I found that the model creates a feature space that puts features from the same batch closer together than the mean aggregation method does. Whether this is a good thing or not is not obvious to me. Note that BR00113818 is not in the training set of the MLP.

Look at these patterns!

![UMAP_MLP](https://user-images.githubusercontent.com/62173977/156071350-cdbf6a72-907a-4ad5-b4f0-294f3ab4a337.png) ![UMAP_BM](https://user-images.githubusercontent.com/62173977/156071369-70ad2541-a4ed-45c3-a75b-0848ad16afab.png)

EchteRobert commented 2 years ago

Experiment 1 (continued)

As the model improved the PS upon the baseline in all of the previous plates, I will now test the model on 5 for more plates from the Stain2 dataset: _BR00113818Redone, _BR00113819Redone, _BR00113820Redone, _BR00113821Redone, and _BR00112197repeat. The PR/PS is reported below. I also plotted the number of cells per well per plate in histograms.

Main takeaways

The model performs similar to or better than the average aggregation method for 3 out of 5 plates. For the remaining two it significantly underperformed however. I expected this to be due to the average number of cells that would be present in the plates. Looking at the histograms of these two plates (_BR00113820Redone and _BR00113821Redone), we can see that this might indeed be the cause as these two plates have a different distribution of cells per well and less cells overall.

Later addition: As discussed with @shntnu I calculated the PC1 loadings per plate and the correlation between these loadings. See below. It shows how especially BR00112203 (training), BR00113819, BR00113820, and BR00113821 do not correlate well with with the other plates in terms of PC1 loadings, i.e. other features are more important to describe the profiles of these plates. Note also that BR0011203, and BR00112199 are used as 2 of the 3 training plates, while these correlate especially less with the two poorly performing plates. Especially because the PR of the BR00112203 (training) is the highest, while its PC1 loadings correlation is relatively low with all other plates it is expected that the model performs worse on all other plates.

Conclusion: the plates used during training probably influence the model to pay more attention to a specific set of features, which are not as relevant for the poorly performing plates.

Are you ready for this?

_BR00112197_repeat_ ![Stain2_BR00112197repeat_PR](https://user-images.githubusercontent.com/62173977/156247147-6dbcff65-1dd9-4a27-aa86-4f08d192a93c.png) _BR00113818_Redone_ ![Stain2_BR00113818_Redone_PR](https://user-images.githubusercontent.com/62173977/156247159-bc14fbb1-73c2-46bd-bd19-f2df2e16ec36.png) _BR00113819_Redone_ ![Stain2_BR00113819_Redone_PR](https://user-images.githubusercontent.com/62173977/156247171-9a06c2b5-d9a4-4e91-8347-23b4beb28253.png) _BR00113820_Redone_ ![Stain2_BR00113820_Redone_PR](https://user-images.githubusercontent.com/62173977/156247179-46e736ce-b8f9-42c7-b86a-965706042598.png) _BR00113821_Redone_ ![Stain2_BR00113821_Redone_PR](https://user-images.githubusercontent.com/62173977/156247194-890db46f-fcad-4321-b9fb-c1e406fade07.png)

Don't forget to look at these!

![BR00112197binned_hist](https://user-images.githubusercontent.com/62173977/156247475-bee185af-e3ce-4083-bf85-56990c7bc626.png) ![BR00113820_hist](https://user-images.githubusercontent.com/62173977/156247445-d7f23d68-8bfa-4143-9be1-b46ff047a564.png) ![BR00113821_hist](https://user-images.githubusercontent.com/62173977/156247454-fc44f452-b8b1-41ad-9bdf-e5c634aeefff.png)

This is additional stuff. Perhaps not as interesting as the first bit? You decide.

![BR00112197repeat_hist](https://user-images.githubusercontent.com/62173977/156247675-37cb712d-50bb-48e4-9e25-493522005112.png) ![BR00112199_hist](https://user-images.githubusercontent.com/62173977/156247693-936146ce-8816-4cde-ae5d-7fe2fa939fb1.png) ![BR00112203_hist](https://user-images.githubusercontent.com/62173977/156247698-b5b3d920-c64b-4b4f-8c3c-8611d41fe6d0.png) ![BR00113818_hist](https://user-images.githubusercontent.com/62173977/156247708-70ff09fb-5291-4f59-a051-0a69cc5bb711.png) ![BR00113819_hist](https://user-images.githubusercontent.com/62173977/156247716-71ab1618-c7c7-4c67-8f6a-d0e64f74fd34.png)

PC1 loadings per plate

![PC1_loadings_Stain2](https://user-images.githubusercontent.com/62173977/156655488-0e66f027-8618-4e31-9d53-990c93d01a6e.png)

Number of cells per well per plate summary

![Stain2_cells](https://user-images.githubusercontent.com/62173977/172942967-53123370-e221-4bb2-b37a-344a796ec044.png)

niranjchandrasekaran commented 2 years ago

The model performs similar to or better than the average aggregation method for 3 out of 5 plates. For the remaining two it significantly underperformed however.

@EchteRobert Quick question - did you recompute Percent Replicating for the baseline using the 1324 features or are these values from the original baseline in https://github.com/jump-cellpainting/pilot-analysis/issues/15#issuecomment-670640802? If it is the latter, I would recommend doing the former so that we are comparing apples to apples.

Also, the cell count histograms surprised me. Given that the only difference between the plates is the dye concentration, I did not expect to see such a huge difference in the number of cells between plates.

EchteRobert commented 2 years ago

I did not @niranjchandrasekaran. Good point. I will recalculate the baseline with 1324 features.

Yes it also surprised me a bit, although I cannot explain why it would be the case. Actually, I encountered the first well in these two plates which did not contain any cells at all.

niranjchandrasekaran commented 2 years ago

On checking the table in https://github.com/broadinstitute/FeatureAggregation_single_cell/issues/5#issuecomment-1054585913, I just realized that the two plates BR00113820_Redone and BR00113821_Redone have different cell seeding density compared to the other plates. So they are expected to have different number of cells.

EchteRobert commented 2 years ago

Experiment (intermediate)

The previous results showed a high non-replicate correlation and, although the replicate correlation was even higher, we would rather like to see a lower non-replicate correlation which would represent a cleaner profile or sharper contrast between replicates and non-replicates. To test this John proposed to change my current feature normalization method (zero-mean 1 standard deviation) to RobustMAD. Secondly, I doubled the batch size during training. This means that there are more negative pairs per batch (as this increases exponentially) which may push the learned profiles further apart.

Main takeaways

The increased batch size in combination with the RobustMAD normalization show that the model has an extremely hard time learning. Upon inspection of the gradients of the model, I saw that these vanished instantly with the first epochs. Returning to the original normalization removed this effect and allowed for better training.

Click here!

_BR00112203 plate (previously highest PR)_ ![Stain2_BR00112203_exp2_BS128_PR](https://user-images.githubusercontent.com/62173977/156440285-3028b11b-5126-4a88-874f-b085cc7d80a5.png)

EchteRobert commented 2 years ago

Experiment 2

As RobustMAD did not do what was expected and the non-replicate correlation did not decrease either, likely due to the model not learning at all, I trained another model with the previous normalization and a higher batch size (80 instead of 128 in the previous post). I also moved to 'cleaner' data (all 'green' plates as indicated in the table here https://github.com/broadinstitute/FeatureAggregation_single_cell/issues/5#issuecomment-1054585913), which may cause the model to perform worse on the 'non-green' plates.

Main takeaways

The model is able to push non-replicate correlation down somewhat, however this comes at the cost of overfitting. The model achieves this on the training plates, but not on the validation plates. I expect that more data will be needed to achieve the best of both worlds.

Losses and PRs!

_BR00112197 standard - training data_ ![Stain2_BR00112197standard_exp2_PR](https://user-images.githubusercontent.com/62173977/156450257-223b35a5-724d-4508-bc44-4dd75a7d0fd3.png) _BR00113818 - non training data_ ![Stain2_BR00113818_PR](https://user-images.githubusercontent.com/62173977/156451037-419b0ac8-0f9d-424e-82fa-ad625a697793.png)

EchteRobert commented 2 years ago

Experiment 3

In https://github.com/broadinstitute/FeatureAggregation_single_cell/issues/5#issuecomment-1054752037 I showed that the model learns to amplify the plate specific signal for the cell profiles. To counteract that a model is trained which also tries to learn across plate replicates. Additionally, one possible reason why the negative correlation has been so high so far, may be that the model learns to separate all plate information. By doing that the model automatically pushes all same plate profiles together and non-replicate profile correlation will become higher in general. Perhaps including across plate replicates will reduce this effect by fully utilizing the latent loss space.

Main takeaways

Non-replicate correlation appears to indeed decrease somewhat as expected, at least for the training plates. However, the model is overfitting very clearly and the overall performance with respect to the previous model is much lower. Decreasing the batch size and increasing the number of plates used for training does not solve this problem. I expect that the model is memorizing specific compounds, but not an aggregation method.

UMAP patterns here!

_UMAP BM same plates as in https://github.com/broadinstitute/FeatureAggregation_single_cell/issues/5#issuecomment-1054752037_ ![UMAP_BM](https://user-images.githubusercontent.com/62173977/156588903-9a33d2e3-11b1-4bdc-b3ad-d035ae24b306.png) _UMAP MLP_ ![UMAP_MLP](https://user-images.githubusercontent.com/62173977/156589035-1172b03d-9fa1-4c00-bed4-242e77f9b889.png) _UMAP BM training plates_ ['BR00112197standard': 0, 'BR00112199': 1, 'BR00112197repeat': 2] ![UMAP_BM_train](https://user-images.githubusercontent.com/62173977/156589076-87e718c0-9a2f-4c6a-9d8a-54d336e7dab9.png) _UMAP MLP training plates_ ![UMAP_MLP_train](https://user-images.githubusercontent.com/62173977/156589134-32f76c83-cd9b-4ff7-9210-eb9b85ce6ac3.png)

Percent histograms here!

**_Training plates_** ![Stain2_BR00112197standard_PR](https://user-images.githubusercontent.com/62173977/156589374-bffdd16d-34b4-4993-a2f0-7a1c06ae9007.png) ![Stain2_BR00112197repeat_PR](https://user-images.githubusercontent.com/62173977/156590301-72bd43d1-f614-4120-8e62-f78a9f96cdbf.png) ![Stain2_BR00112199_PR copy](https://user-images.githubusercontent.com/62173977/156590024-1d65c2d7-ab35-4137-8243-372020659ade.png) **_Test plate_** ![Stain2_BR00113818_PR](https://user-images.githubusercontent.com/62173977/156590085-20dcb957-0cb1-4d7c-b375-c3a7f572e060.png)

shntnu commented 2 years ago

As discussed with @shntnu I calculated the PC1 loadings per plate and the correlation between these loadings.

@EchteRobert Awesome! What you essentially did here was measure the distribution similarity between all pairs of plates. The first PC is a quick way to do that.

Comparing the PC1 loadings of two multivariate distributions is a shortcut for comparing the covariance matrices of the two multivariate distributions. If the distributions are truly multivariate gaussian (good luck with that, haha!), then it's actually a very good approximation (to the extent that PC1 explains a large fraction of the variance).

If you really want to go down this rabbit hole (⚠️ stop, don't ! ⚠️ ) read up

EchteRobert commented 2 years ago

Experiment 3V2

Learning from previous experiments, I used the following experiment setup:

Use 5 plates as training/validation data, which have the lowest correlation with other plates based on the PC1 loadings shown in https://github.com/broadinstitute/FeatureAggregation_single_cell/issues/5#issuecomment-1055850503. These are: ['BR00112197binned_FS', 'BR00112199_FS', 'BR00112203_FS', 'BR00113818_FS', 'BR00113820_FS']
Replicates are only considered within wells, as across wells lead to poor performance (and perhaps is also not a sensible training method given the evaluation method/goal for the model).
A larger batch size of 72 was used, to increase the number of negative pairs per batch.
500 cells were consistently sampled 3 times per well per batch (this is no different than other experiments, but it may change in future ones so I'm pointing it out here).
The cosine similarity distance metric is used instead of SNR distance to ensure that hard positive mining is performed during the SupCon loss calculation. We will see that this also changes the loss features space for the better.
The number of parameters in the model is increased fourfold to decrease underfitting of the model.

Below I will show:

The PC1 loadings of the model aggregated cell profiles
The PR of all 13 plates in Stain2
The (mean) mean average precision (mAP)of the training and validation compounds for the benchmark (mean aggregation) and the model. For the worst performing ones I will show the mAP per compound for that validation set.

Main takeaways

The PC1 loadings of the model features are similar to those of the BM for most plates, however the BR00113820 and BR00113821 plates are even stronger outliers now and the BR00112203 is a much smaller outlier.
The model achieves higher PR scores than the baseline for all plates now.
The PR distribution of the non-replicates is centered more around zero, due to the switch to the cosine similarity (which is normalized).
The model has overfit the training set by quite a lot, which can be seen by the PR scores as well as the mAP scores.
The mAP of the MLP training compounds is higher than that of the BM training compounds, while the mAP of the MLP validation compounds is generally higher than that of the BM validation compounds. This observation shows the potential of the model to generalize to unseen compounds. Perhaps with some form of regularization the generalization of the model to unseen compound types can be increased.

PC1 loadings of the model profiles

![PC1_loadings_MLP_Stain2exp3V2](https://user-images.githubusercontent.com/62173977/157282958-c919924f-b15e-49df-81aa-3c8e4f54fc5f.png)

PR but in a new latent loss space!

| **Plate** | **Percent Replicating** | |--------------------|-------------------------| | _Training_ | | | BR00112197binned | 88.9 | | BR00112199 | 91.1 | | BR00112203 | 88.9 | | BR00113818 | 84.4 | | BR00113820 | 97.8 | | _Validation_ | | | BR00112197repeat | 72.2 | | BR00112197standard | 72.2 | | BR00112198 | 63.3 | | BR00112201 | 72.2 | | BR00112202 | 56.7 | | BR00112204 | 61.1 | | BR00113819 | 67.8 | | BR00113821 | 50.0 | ![Stain2_BR00113820_PR](https://user-images.githubusercontent.com/62173977/157284712-32963efe-7c06-42a0-83a9-a7bdd0409561.png) ![Stain2_BR00113821_PR](https://user-images.githubusercontent.com/62173977/157284732-4aa314f3-d240-49e2-aa84-9a7de1f8b261.png)

A new metric approaches!

_5 plates are used to train the model (as shown in the 'Plate' column). During training 80% of the compounds are used to train the model and 20% of the compounds (the same ones for each plate) are used as a hold out or validation set._ | **Plate** | **training compounds MLP** | **training compounds BM** | **validation compounds MLP** | **validation compounds BM** | |--------------------|----------------------------|---------------------------|------------------------------|-----------------------------| | _Training_ | | | | | | BR00112197binned | **0.44** | 0.41 | 0.20 | **0.30** | | BR00112199 | **0.38** | 0.32 | 0.20 | **0.28** | | BR00112203 | **0.49** | 0.30 | 0.16 | **0.27** | | BR00113818 | **0.43** | 0.28 | 0.17 | **0.30** | | BR00113820 | **0.59** | 0.30 | 0.18 | **0.30** | | _Validation_ | | | | | | BR00112197repeat | 0.29 | **0.41** | 0.25 | **0.31** | | BR00112197standard | 0.32 | **0.40** |0.27 | **0.28** | | BR00112198 | 0.27 | **0.35** | 0.26 | **0.30** | | BR00112201 | 0.26 | **0.40** | 0.22 | **0.32** | | BR00112202 | 0.25 | **0.34** | 0.24 | **0.30** | | BR00112204 | 0.24 | **0.35** | 0.25 | **0.29** | | BR00113819 | 0.24 | **0.28** | 0.17 | **0.25** | | BR00113821 | 0.19 | **0.24** | 0.12 | **0.22** |

mAP BR00112201

Plate: BR00112201 Total mean:0.25251311463707016 _Training samples mean AP: 0.259931_ | compound | AP | |:---------------------|----------:| | PF-477736 | 1 | | AMG900 | 1 | | APY0201 | 1 | | AZD2014 | 1 | | GDC-0879 | 1 | | acriflavine | 1 | | RG7112 | 0.930556 | | GSK-J4 | 0.897222 | | Compound2 | 0.830556 | | BLU9931 | 0.677167 | | BI-78D3 | 0.668651 | | SCH-900776 | 0.640873 | | CPI-0610 | 0.572222 | | SU3327 | 0.510317 | | ABT-737 | 0.480423 | | Compound7 | 0.472073 | | -GNF 5 | 0.469444 | | MK-5108 | 0.447917 | | THZ1 | 0.422808 | | NVS-PAK1-1 | 0.347374 | | SU-11274 | 0.32939 | | GW-5074 | 0.246392 | | GSK2334470 | 0.246166 | | BX-912 | 0.24095 | | NVP-AEW541 | 0.23775 | | CHIR-99021 | 0.220037 | | dosulepin | 0.202143 | | GSK-3-inhibitor-IX | 0.172313 | | PD-198306 | 0.148742 | | PFI-1 | 0.14835 | | Compound3 | 0.145067 | | BMS-566419 | 0.12329 | | BMS-863233 | 0.121743 | | apratastat | 0.118872 | | WZ4003 | 0.114163 | | ICG-001 | 0.11288 | | PNU-74654 | 0.0874405 | | ML324 | 0.0822136 | | Compound5 | 0.0819586 | | GW-3965 | 0.0698881 | | SGX523 | 0.0628168 | | AZ191 | 0.0614712 | | A-366 | 0.0492269 | | halopemide | 0.0481211 | | FR-180204 | 0.0474747 | | BIX-02188 | 0.044098 | | Compound4 | 0.0427142 | | AZD7545 | 0.0417633 | | SHP 99.00 | 0.0412191 | | RGFP966 | 0.0397035 | | IOX2 | 0.0396046 | | CP-724714 | 0.0378228 | | EPZ015666 | 0.037468 | | AMG-925 | 0.0353015 | | VX-745 | 0.0336891 | | SGC-707 | 0.0329782 | | P5091 | 0.0326774 | | Compound6 | 0.0305971 | | delta-Tocotrienol | 0.0295755 | | Compound1 | 0.0279454 | | PS178990 | 0.0278597 | | carmustine | 0.0272295 | | T-0901317 | 0.0272058 | | andarine | 0.0257093 | | UNC0642 | 0.0257052 | | dimethindene-(S)-(+) | 0.0252354 | | ML-323 | 0.0244636 | | ML-298 | 0.0232809 | | Compound8 | 0.0218036 | | SAG | 0.0198054 | | KH-CB19 | 0.0187536 | | filgotinib | 0.0143387 | _Validation samples mean AP: 0.222843_ | compound | AP | |:-------------------|----------:| | valrubicin | 0.830159 | | sirolimus | 0.647222 | | romidepsin | 0.614379 | | ponatinib | 0.489386 | | merimepodib | 0.373039 | | ispinesib | 0.357657 | | neratinib | 0.250216 | | veliparib | 0.0939503 | | orphenadrine | 0.0710256 | | ruxolitinib | 0.0683867 | | hydroxyzine | 0.0374705 | | selumetinib | 0.0353887 | | pomalidomide | 0.0339397 | | skepinone-l | 0.0242614 | | homochlorcyclizine | 0.0220177 | | rheochrysidin | 0.0216262 | | quazinone | 0.0209096 | | purmorphamine | 0.0201343 |

EchteRobert commented 2 years ago

To get an overview of all the PRs based on training/validation plates and training/validation compounds like for the mAP. Generally speaking, the PR values correlate highly with the mAP values that were reported in https://github.com/broadinstitute/FeatureAggregation_single_cell/issues/5#issuecomment-1062006634.

Excel table

EchteRobert commented 2 years ago

Experiments

The model showed in previous comments is overfitting the training dataset. This means it does not beat the baseline in mean average precision when comparing its profiles created for validation (hold-out) compounds, validation (hold-out) plates, or both. There are two main ideas to reduce overfitting on 1. plates and 2. compounds:

Consider replicates across plates
Aggregate all same-compound cells from wells within a plate, into a super well if you will, and then sampling new 'augmented wells' from this super well. This should increase the variability of single-cell well compositions and reduce compound overfitting. (3. A possible extension of 1. and 2. is to also merge ALL compound wells across ALL plates (to form super super wells?))

Main takeaways

I will not show the results as there are too many different experiments, but instead outline the most important findings.

Using across plate replicates did not result in higher performance (PR/mAP) on validation plates. Results are instead somewhat worse to previous models. I expect this to be due to the training task (finding a latent space representation that attracts/repels across plate replicates) differing too much from the evaluation task (checking if these latent space representations attracts/repels within plate replicates).
Aggregating same-compound wells has a strong regularizing effect on the model performance: training plates now achieve similar performance to validation plates, but also no longer beat the baseline performance.
Training and Validation loss decrease together nicely (no more overfitting) when across plate replicates are no longer considered, but creating super wells still is. However, it turns out that by creating super wells and sampling augmented wells from these the model learns something very different from what the evaluation task is. What it learns is is not exactly clear, but I think that because samples are now all created from a similar (aggregated) distribution, and thus contain cells which originated from the same well, they are much much easier to distinguish. Basically, it is matching cells with the same feature profiles which originated from the same well instead of finding a good aggregation method for the entire well.

Next up

A possible improvement will be to reduce the data augmentation a bit. Instead, only creating super wells 50% of the time. The other 50% sampling will be done from a single well. Additionally, super wells are created by aggregating only 2 of the 4 available wells (chosen at random). Another improvement is the normalization method. I will now normalize all wells across the entire plate before training the model on the wells. First this normalization was done per well.

EchteRobert commented 2 years ago

Experiment

Results of the 'Next up' experiment described here: https://github.com/broadinstitute/FeatureAggregation_single_cell/issues/5#issuecomment-1071401689

Main takeaways

The model is now able to also beat the mAP of the validation compounds in the training plates
It also beats the mAP of both the training and the validation compounds in some of the validation plates. It was not able to do either before.
The 4 plates where did not outperform the BM in any of the metrics are the furthest away from the training plates (see PC1 loadings plot), so this is also an expected result.

Next up

It's possible that a separate model is need for the plates where the model did not perform as well yet. I will try training a separate model for those plates next.
I will also try training a model with across replicate correlations again, to see if it does improve generalization using this new training setup.

EXCITING!

_Results in bold are the highest score_ | plate | Training mAP model | Training mAP BM | Validation mAP model | Validation mAP BM | PR model | PR BM | |------------------------|----------------:|----------------:|---------------:|---------------:|--------:|--------:| | _Training plates_ | | | | | | | | **BR00112201** | **0.66** | 0.40 | **0.43** | 0.32 | **98.9** | 66.7 | | **BR00112198** | **0.56** | 0.35 | **0.4** | 0.30 | **100** | 56.7 | | **BR00112204** | **0.59** | 0.35 | **0.35** | 0.29 | **100** | 58.9 | | _Validation plates_ | | | | | | | | **BR00112202** | **0.44** | 0.34 | **0.31** | 0.30 | **93.3** | 54.4 | | **BR00112197standard** | **0.47** | 0.40 | **0.34** | 0.28 | **94.4** | 56.7 | | BR00112203 | 0.19 | **0.30** | 0.21 | **0.27** | 52.2 | **56.7** | | BR00112199 | 0.3 | **0.32** | 0.23 | **0.28** | **76.7** | 57.8 | | **BR00113818** | **0.32** | 0.28 | 0.24 | **0.30** | **77.8** | 52.2 | | **BR00113819** | **0.32** | 0.28 | 0.21 | **0.25** | **70** | 48.9 | | **BR00112197repeat** | **0.47** | 0.41 | **0.37** | 0.31 | **92.2** | 63.3 | | BR00113820 | 0.27 | **0.30** | 0.24 | **0.30** | **58.9** | 55.6 | | BR00113821 | 0.15 | **0.24** | 0.16 | **0.22** | 38.9 | **47.8** | | **BR00112197binned** | **0.41** | 0.41 | **0.34** | 0.30 | **91.1** | 58.9 |

shntnu commented 2 years ago

👀 🎊

EchteRobert commented 2 years ago

Experiment

Building upon the setup in the previous experiment I now train and evaluate a model on across plate compound replicates. The training set consists of the same 3 plates: BR00112201, BR00112198, and BR00112204. The validation set contains only the BR00112202, BR00112197standard, BR00113818, BR00113819, BR00112197repeat, and BR00112197binned. Note that I am only selecting the plates here that are close to the training sets, this is because I am considering across plate correlations and the other 4 outlier plates look at different features. I group the outlier plates in a separate validation set and compute the results for this set for completeness sake, but I do not think this last set is useful for analysis due to their different feature importances.

I compute the baseline mAP (and PR) using the mean aggregation method for these two sets with across plate replicates of compounds, and do the same using the model aggregation method.

Main takeaways

The model achieves better mAP scores than the baseline method in matching compounds across plates in both the training and validation set.
The model achieves worse mAP scores than the baseline method in matching compounds across plates in the outlier set. This is expected.
The model achieves generally lower mAP scores on finding within plate replicates than the previous model (https://github.com/broadinstitute/FeatureAggregation_single_cell/issues/5#issuecomment-1072776484). It also beats the baseline mean aggregation less often in validation plates. This seems like a logical consequence of requiring the model to adjust for various staining concentrations.

Next up

It's possible that a separate model is need for the outlier plates. I will try training a separate model for those plates next. I am curious to see if this model in turn will perform poorly on the training and validation plates used in this experiment.

CrissCross mAP🔀

_Across plate compound correlations_ -- I do not report the PR, because all of these are (close to) 100 percent. I expect this to be due to the high number of replicates that are now being considered (perhaps I need to increase the number of samples used for the non-replicate correlation calculation?). -- | plate set | Training mAP model | Training mAP BM | Validation mAP model | Validation mAP BM | |----------------|-------------------:|----------------:|---------------------:|------------------:| | Training set | **0.48** | 0.30 | **0.35** | 0.30 | | Validation set | **0.31** | 0.23 | **0.28** | 0.21 | | Outlier set | 0.11 | **0.15** | 0.09 | **0.13** | _Within plate compound correlations_ | plate | Training mAP model | Training mAP BM | Validation mAP model | Validation mAP BM | PR model | PR BM | |:-------------------|---------------------:|------------------:|-----------------------:|--------------------:|-----------:|--------:| | _Training plates_ | | | | | | | | BR00112201 | **0.58** | 0.4 | **0.37** | 0.32 | **98.9** | 66.7 | | BR00112198 | **0.53** | 0.35 | **0.34** | 0.3 | **97.8** | 56.7 | | BR00112204 | **0.53** | 0.35 | **0.35** | 0.29 | **98.9** | 58.9 | | _Validation plates_ | | | | | | | | **BR00112202** | **0.43** | 0.34 | **0.36** | 0.3 | **88.9** | 54.4 | | **BR00112197standard** | **0.46** | 0.4 | **0.39** | 0.28 | **92.2** | 56.7 | | BR00112203 | 0.18 | **0.3** | 0.16 | **0.27** | 48.9 | **56.7** | | BR00112199 | 0.28 | **0.32** | 0.18 | **0.28** | **68.9** | 57.8 | | BR00113818 | 0.26 | **0.28** | 0.26 | **0.3** | **70** | 52.2 | | BR00113819 | 0.25 | **0.28** | 0.19 | **0.25** | **72.2** | 48.9 | | **BR00112197repeat** | **0.44** | 0.41 | **0.36** | 0.31 | **86.7** | 63.3 | | BR00113820 | 0.25 | **0.3** | 0.2 | **0.3** | **64.4** | 55.6 | | BR00113821 | 0.17 | **0.24** | 0.18 | **0.22** | 45.6 | **47.8** | | BR00112197binned | 0.41 | 0.41 | **0.4** | 0.3 | **88.9** | 58.9 |

EchteRobert commented 2 years ago

Experiment

To see if my hypothesis* is true, I trained a model on 2 of the outlier plates (BR00113819 and BR00113821). I then calculated the same performance metrics as before. The model was trained without creating pairs across plates, only within each plate.

*Training on plates which are similar according to the PC1 loadings plot, will lead to poor performance of the model on plates which are dissimilar to the training plates.

Main takeaways

I expected the model to beat the baseline for plates BR00113818 and BR00113820, and although it did not perform very poorly on these plates, it did not beat the baseline in all metrics.
In fact, only for these two validation plates did the model prediction outperform the baseline for the training compounds, while performing worse on the validation compounds. The opposite is true for BR00112202, BR00112197standard, BR00112197repeat, BR00112204, and BR00112201. So it appears the model has overfit the training compounds for the plates that are similar to the training plates, but still learned a decent aggregation of the validation compounds for the validation plates.
None of the model predictions for the validation plates achieved better performance than the baseline in all metrics. This may be due to the larger differences between the training plates and validation plates used in this experiment than in the previous experiment.

Next up

Time to evaluate on Stain3.

TableTime!

| plate | Training mAP model | Training mAP BM | Validation mAP model | Validation mAP BM | PR model | PR BM | |:-------------------|---------------------:|------------------:|-----------------------:|--------------------:|-----------:|--------:| | _Training plates_ | | | | | | | | BR00113819 | **0.58** | 0.28 | **0.28** | 0.25 | **97.8** | 48.9 | | BR00113821 | **0.59** | 0.24 | 0.22 | 0.22 | **96.7** | 47.8 | | _Validation plates_ | | | | | | | | BR00112202 | 0.33 | **0.34** | **0.34** | 0.3 | **80** | 54.4 | | BR00112197standard | 0.32 | 0.4 | **0.34** | 0.28 | **78.9** | 56.7 | | BR00112203 | 0.16 | **0.3** | 0.18 | **0.27** | 38.9 | **56.7** | | BR00112199 | 0.17 | **0.32** | 0.16 | **0.28** | 40 | **57.8** | | BR00113818 | **0.35** | 0.28 | 0.24 | **0.3** | **76.7** | 52.2 | | BR00112198 | 0.27 | **0.35** | 0.28 | **0.3** | **66.7** | 56.7 | | BR00112197repeat | 0.33 | **0.41** | **0.34** | 0.31 | **70** | 63.3 | | BR00112204 | 0.28 | **0.35** | **0.35** | 0.29 | **66.7** | 58.9 | | BR00113820 | **0.36** | 0.3 | 0.25 | **0.3** | **84.4** | 55.6 | | BR00112197binned | 0.28 | **0.41** | 0.3 | 0.3 | **65.6** | 58.9 | | BR00112201 | 0.38 | **0.4** | **0.34** | 0.32 | **86.7** | 66.7 |

EchteRobert commented 2 years ago

Evaluation

As an additional evaluation at the compound level, I compared the mAP between the model and the benchmark for the 'within cluster plates' (see PC1 loadings plot for the cluster) to see if there are specific compounds which consistently perform worse or better than the benchmark while using the model.

Colorful bubble graph training compounds!

Colorful bubble graph validation compounds!

EchteRobert commented 2 years ago

Evaluation Stain3 optimized model

After tuning a bunch of hyperparameters using Stain3 plates I trained a model on Stain2 plates using the same hyperparameters and training methods to see if this new setup is compatible across plates. I changed the data that is used to calculate the validation loss, so that selecting the best validation loss model will actually yield the best performance on the validation compounds. See https://github.com/broadinstitute/FeatureAggregation_single_cell/issues/6#issuecomment-1095241531 for the finding of this validation loss issue and https://github.com/broadinstitute/FeatureAggregation_single_cell/issues/6#issuecomment-1095206104 for the hyperparameter experiment details.

Main takeaways

The model has actually improved all scores both for training and validation data, showing that the optimized parameters work for this task in a more general sense than just for Stain3 plates.
The update on the validation loss now better represents the performance on the validation compounds. The model with the best validation loss performs equal or better than the last epoch model for 6 out of 7 plates on the validation compounds.

Results

mAP table with last epoch model here!

| plate | Training mAP model | Training mAP BM | Validation mAP model | Validation mAP BM | PR model | PR BM | |:-------------------|---------------------:|------------------:|-----------------------:|--------------------:|-----------:|--------:| | _Training plates_ | | | | | | | | BR00112201 | **0.81** | 0.4 | **0.47** | 0.32 | 100 | 66.7 | | BR00112198 | **0.78** | 0.35 | **0.49** | 0.3 | 100 | 56.7 | | BR00112204 | **0.82** | 0.35 | **0.42** | 0.29 | 100 | 58.9 | | _Validation plates_ | | | | | | | | BR00112202 | **0.52** | 0.34 | **0.35** | 0.3 | 94.4 | 54.4 | | BR00112197standard | **0.54** | 0.4 | **0.44** | 0.28 | 95.6 | 56.7 | | BR00112197repeat | **0.55** | 0.41 | **0.4** | 0.31 | 95.6 | 63.3 | | BR00112197binned | **0.48** | 0.41 | **0.41** | 0.3 | 91.1 | 58.9 |

mAP table with best validation loss model here!

Numbers in bold are **better** than the last epoch model. Numbers in italic are _worse_. | plate | Training mAP model | Training mAP BM | Validation mAP model | Validation mAP BM | PR model | PR BM | |:-------------------|---------------------:|------------------:|-----------------------:|--------------------:|-----------:|--------:| | _Training plates_ | | | | | | | | BR00112201 | 0.65 | 0.4 | _0.45_ | 0.32 | 98.9 | 66.7 | | BR00112198 | 0.59 | 0.35 | 0.49 | 0.3 | 98.9 | 56.7 | | BR00112204 | 0.59 | 0.35 | **0.46** | 0.29 | 100 | 58.9 | | _Validation plates_ | | | | | | | | BR00112202 | 0.48 | 0.34 | **0.37** | 0.3 | 95.6 | 54.4 | | BR00112197standard | 0.51 | 0.4 | 0.44 | 0.28 | 93.3 | 56.7 | | BR00112197repeat | 0.49 | 0.41 | **0.47** | 0.31 | 93.3 | 63.3 | | BR00112197binned | 0.46 | 0.41 | 0.41 | 0.3 | 85.6 | 58.9 |

carpenter-singh-lab / 2024_vanDijk_PLoS_CytoSummaryNet

03. Model for Stain2 #5

Experiment 1

Main Takeaways

Conclusion

Experiment 1 (continued)

Main takeaways

Experiment (intermediate)

Main takeaways

Experiment 2

Main takeaways

Experiment 3

Main takeaways

Experiment 3V2

Main takeaways

Experiments

Main takeaways

Next up

Experiment

Main takeaways

Next up

Experiment

Main takeaways

Next up

Experiment

Main takeaways

Next up

Evaluation

Evaluation Stain3 optimized model

Main takeaways

Results