P2 01. Testing the method on subsets of the LINCS dataset

EchteRobert commented 2 years ago

LINCS contains 6 dose points: 0.04 µM, 0.12 µM, 0.37 µM, 1.11 µM, 3.33 µM, and 10 µM. For my experiments, I will use the highest dose (10 µM) as the training set and the validation set. The model is trained to create profiles that attract replicate compound profiles and repel non-replicate compound profiles. It is then validated by evaluating the ability of these profiles to predict MoAs (or find sister compounds). Finally, the model will be tested on the 3.3 µM dose point data as a hold-out set. This data should look significantly different from the training and validation data.

I will follow the same data exclusion protocol as Michael did in his research:

“The evaluation metrics all require the number of compounds in an MOA to be at least two (MOA class size of at least two). A compound that does not have other compounds annotated with the same MOA cannot have a precision value and does not affect Enrichment and Hit@k. Additionally, there is no use for compounds with unknown MOA labels, because I can not apply any metrics. Finally, I used only a single dose per compound to make experimental iterations practical. I picked the 10uM dose point because we found that this dose produces the strongest phenotypes across compounds without reducing the ability to group MOAs [42]. These three requirements lead to a subset of the LINCS data in which I 1) only keep the highest dose of each perturbation, 2) delete all compounds that have unknown MOA labels, and 3) drop all compounds with MOA sizes smaller than two. All further processing and experiment are based on this subset of data. This subset process is documented in my repository.”

Which should result in similar data numbers:

“The resulting (sub-selected) dataset comprises 8,818 wells spanning over 136 plates, five different batches, and 18.2 million cells. Each of the 1,144 perturbations (compounds) have five technical replicates - each on a separate plate but in the same position on the plate (same well position). Four of these technical replicate wells were produced in the same batch. This is highly relevant for training and technical artifacts, because I expect cells in the same well and same batch to have similar technical artifacts, but the corresponding wells from a different batch may look very different from one another. Each of the 136 plates has 24 negative controls (DMSO) distributed across the plate (3264 DMSOs in total). The 1,144 compounds can be categorized into 235 MOAs, and every MOA has at least two compounds (MOA class size > 1). Some MOAs, such as phosphodiesterase inhibitor, have an MOA class size of 35, while most other MOAs are only found for two compounds.”

I have put some relevant quotes from Michael's work and the LINCS manuscript here: https://docs.google.com/document/d/1z2U5o91vzBwB-4xtryYn5d3kSWJi8ZSerE_MAzdLT_0/edit#heading=h.sbi6l2r6p5ec

EchteRobert commented 2 years ago

Experiment 0: test the method on 3 plates from LINCS (mixed doses)

I will start off with plates SQ00015231, SQ00015232, and SQ00015233. First I will have to remove all low dose (<3.33 µM) wells from each plate. Then I will train the model as usual, using the highest dose (10 µM) replicates for the contrastive learning objective. Finally, the model will be evaluated on MoA prediction using the highest dose (10 µM) wells. I will not use the 3.33 µM wells during training/validation and only use those during the final testing phase.

I have now updated the scripts to work for the LINCS data, but I still need to filter the dose points before training. This is next up on the list.

I have already trained a model using the original setup but without filtering doses. This means that replicate compound profiles are forced to attract during training, even if they were created with different doses.

Main takeaway

The replicate mAP was very high but no improvement was found in MoA prediction (both were also evaluated using different dose points).

Click to expand!

_Training and validation curves_ Screen Shot 2022-08-04 at 4 30 35 PM

_MoA prediction_ | plate | mAP model | mAP BM | |:-----------|------------:|---------:| | SQ00015232 | 0.04 | 0.04 | | SQ00015233 | 0.02 | 0.03 | | SQ00015231 | 0.03 | 0.04 | _Replicate prediction_ | plate | Training mAP model | Training mAP BM | |:-----------|---------------------:|------------------:| | SQ00015232 | 0.89 | 0.38 | | SQ00015233 | 0.86 | 0.35 | | SQ00015231 | 0.85 | 0.37 |

EchteRobert commented 2 years ago

Experiment 1: test the method on 3 plates from LINCS (only 10 µM dose)

I trained a model on plates SQ00015231, SQ00015232, and SQ00015233 using only the 10 µM dose point and training on forming replicates between compounds across plates. The latter is new compared to previous experiments. Earlier on when experimenting on the Stain datasets, I found that training on across plate replicates reduced the model's ability to generalize to hold-out compounds. This may also negatively influence the model's ability to predict MoAs. It's possible that more plates are needed for generalization anyways (3 plates did not work well for Stain data either).

Main takeaways

The model improves replicate prediction significantly: it's nearly perfect (1.0 mAP). However, as expected it does not improve MoA prediction. I will probably have to use more plates during training to improve generalization.

Results

Training curves

_Loss curves of experiment 0 and experiment 1_ Screen Shot 2022-08-22 at 7 47 50 PM

mAP of replicate and MoA prediction

_mAPs replicate prediction_ | plate | Training mAP model | Training mAP BM | |:---------------------------------|---------------------:|------------------:| | SQ00015231_SQ00015232_SQ00015233 | 1 | 0.59 | _mAP model MoA prediction_ | compound | AP | precision at R | |:--------------------------------|----------:|-----------------:| | cyclooxygenase inhibitor | 0.0385604 | 0 | | leukotriene synthesis inhibitor | 0.0255024 | 0 | | integrin antagonist | 0.0220234 | 0 | _mAPs MoA prediction_ | plate | mAP model | mAP BM | |:---------------------------------|------------:|---------:| | SQ00015231_SQ00015232_SQ00015233 | 0.03 | 0.08 |

Baseline LINCS

It looks like the LINCS results were generated by

creating the profiles
normalizing/spherizing per plate to correct for batch effects
creating consensus profiles per compound
applying feature selection on the consensus profiles per plate
putting all of these processed consensus profiles together in a single heap and calculating the metrics between them per dose point (and thus across plate).

From the manuscript:

"Because the negative control DMSO profiles were noisy due to technical artifacts, we applied a spherize transform (also known as whitening) to mitigate the impact of well positioning. More specifically, we used the zero-phase whitening filters (ZCA) solution calculated on the profile correlation matrix (ZCA-cor) to minimize the absolute distance between the transformed profiles and the untransformed profiles. We also formed consensus signatures (level 5) by moderated z-score (MODZ) aggregating all replicate wells across plate maps into a single signature. We applied feature selection to the consensus signatures and batch effect corrected profiles separately using the same operations as described above. We applied the same pipeline to batch 1 (A549) and batch 2 (A549, MCF7, and U2OS) Cell Painting datasets."

EchteRobert commented 2 years ago

Experiment 2: test the method on 6 plates from LINCS (only 10 µM dose)

Using plates SQ00015224, SQ00015223, SQ00015230, SQ00015231, SQ00015229, SQ00015233, and SQ00015232 as the training data and considering across plate replicates, the model was trained in the same way as in Exp. 1. Using more plates during training is expected to improve MoA prediction performance compared to Exp. 1.

Main takeaways

The model is still not generalizing to MoA prediction. This is probably due to the lack of data. In the next experiment I will try to use ~30 plates.

Results

The model is converging slower than the previous models - I think because there are more compounds that can be used for replicate training, increasing the training task complexity. We can also see that there are more MoAs that are taken into account for the evaluation of the MoA prediction task. The model is beating BM performance in the replicating task, but not in the MoA prediction task - the performance for this task remains near random (as in Exp. 1). The total number of samples, due to filtering of lower dose points, is only the size of one plate (~360 samples). This could explain the lack of generalization of the model. It's possible I will need to at least use 5 times as many plates (~30).

Loss curves here!

![Screen Shot 2022-08-24 at 11 27 14 AM](https://user-images.githubusercontent.com/62173977/186383078-2d47537b-c56f-4e19-8b7f-e8a3ceac1195.png)

mAP of replicate and MoA prediction

_mAPs replicate prediction_ | plate | Training mAP model | Training mAP BM | |:------------------------------------------------------------------|---------------------:|------------------:| | SQ00015223_SQ00015224_SQ00015230_SQ00015231_SQ00015232_SQ00015233 | 0.87 | 0.67 | _mAP model MoA prediction_ | compound | AP | precision at R | |:--------------------------------|----------:|-----------------:| | leukotriene synthesis inhibitor | 0.0475004 | 0 | | cyclooxygenase inhibitor | 0.0436489 | 0 | | histamine receptor antagonist | 0.0307984 | 0 | | integrin antagonist | 0.0279527 | 0 | | HCV inhibitor | 0.0162529 | 0 | | MEK inhibitor | 0.0158495 | 0 | | dopamine receptor antagonist | 0.0113624 | 0 | _mAPs MoA prediction_ | plate | mAP model | mAP BM | |:------------------------------------------------------------------|------------:|---------:| | SQ00015223_SQ00015224_SQ00015230_SQ00015231_SQ00015232_SQ00015233 | 0.03 | 0.09 |

EchteRobert commented 2 years ago

Experiment 3: test the method on 26 plates with 1439 wells on the server (bugged)

I have now moved the python environment and all of the data to John's server to accommodate for the larger memory requirements of the LINCS dataset. Effectively these 26 plates only correspond to ~4 plates in training data, which is nowhere close to the 15 I used for training the final model on the Stain datasets. I have monitored all the steps I had to take and updated those in the README of this repository for future reference. The model was trained using the same default hyperparameters as before. The updated pipeline uses 1781 features instead of 1783, possibly due to some Image or Metadata features being filtered out correctly now.

Main takeaways

Something went wrong during training. I expect the problem to lie with the data. I will investigate the issue and improve the training pipeline.

After investigating a bit, the issue might be related to same compounds getting different labels.

Results

The loss curves show that the model was not able to learn the task correctly. The validation mAP does keep increasing (although on a much smaller scale than it would normally), which may indicate that the training procedure is correct. I will investigate the training data to see if something is wrong there.

Loss curves

EchteRobert commented 2 years ago

Experiment 3: test the method on 26 plates with 458 wells on the server

I resolved the issue and reran the experiment as described in https://github.com/broadinstitute/FeatureAggregation_single_cell/issues/12#issuecomment-1236726249. I now aggregate all the wells from all the plates first. I now also add the perturbation and moa information per well and then remove any wells that do not contain any perturbation information. Finally, I remove all compounds which occur only once in the entire dataset. This results in:

Removed 318 wells due to missing annotation of pert_iname and moa.
Removed 663 unique compound wells.
Using 458 wells

Note that this is only equal to ~1.5 plates worth of data!

Main takeaways

The model is still not better at MoA prediction than the benchmark. In previous experiments, the model only started beating this performance after I started using >~9 plates. Another method would be to add more data augmentation, i.e., random sampling from the wells.

Results

The model now trains correctly and we can see that the loss curves converge properly. Percent replicating is nearly perfect, while the model still lacks improved MoA prediction compared to the benchmark.

Loss curves

![Screen Shot 2022-09-06 at 12 35 12 PM](https://user-images.githubusercontent.com/62173977/188614041-c543e563-0796-4814-abb6-89d1f4e90764.png)

mean average precisions scores

_mAP replicate prediction_ | plate | Training mAP model | Training mAP BM | Validation mAP model | Validation mAP BM | |:----------------|---------------------:|------------------:|-----------------------:|--------------------:| | all plates* | 0.98 | 0.51 | nan | nan | | plate | mAP model | mAP BM | |:-----------|------------:|---------:| | all plates* | 0.04 | 0.07 | *SQ00015199_SQ00015200_SQ00015201_SQ00015202_SQ00015203_SQ00015204_SQ00015205_SQ00015206_SQ00015207_SQ00015208_SQ00015209_SQ00015210_SQ00015211_SQ00015212_SQ00015214_SQ00015215_SQ00015216_SQ00015217_SQ00015218_SQ00015219_SQ00015220_SQ00015221_SQ00015222_SQ00015223_SQ00015224_SQ00015230

EchteRobert commented 2 years ago

Experiment 4: test the method on 38 plates with 1008 wells

I added more plates and repeated the experiment. The number of wells used is still only equal to ~3 plates total. I will have to add even more to make this work.

Removed 477 wells due to missing annotation of pert_iname and moa.
Removed 624 unique compound wells.
Using 1008 wells

Results here!

![Screen Shot 2022-09-15 at 2 43 25 PM](https://user-images.githubusercontent.com/62173977/190406652-0fc6010b-4a61-4764-80f3-967ad6cf6a88.png) _replicate prediction_ | plate | Training mAP model | Training mAP BM | Validation mAP model | Validation mAP BM | |:---------|---------------------:|------------------:|-----------------------:|--------------------:| | all plates* | 0.89 | 0.38 | nan | nan | _MoA prediction_ | plate | mAP model | mAP BM | |:----------|------------:|---------:| | all plates* | 0.07 | 0.08 | *SQ00015153_SQ00015167_SQ00015168_SQ00015169_SQ00015170_SQ00015171_SQ00015172_SQ00015173_SQ00015194_SQ00015195_SQ00015196_SQ00015198_SQ00015199_SQ00015200_SQ00015201_SQ00015202_SQ00015203_SQ00015204_SQ00015205_SQ00015206_SQ00015207_SQ00015208_SQ00015209_SQ00015210_SQ00015211_SQ00015212_SQ00015214_SQ00015215_SQ00015216_SQ00015217_SQ00015218_SQ00015219_SQ00015220_SQ00015221_SQ00015222_SQ00015223_SQ00015224_SQ00015230

EchteRobert commented 2 years ago

Experiment 5: test the method on 81 plates with 3306 - BS 36 and BS 72

Removed 958 wells due to missing annotation of pert_iname and moa.
Removed 226 unique compound wells.
Using wells 3306
Loading 81 plates with 783 unique compounds.

Main takeaways

Training on this number of plates improves the MoA prediction accuracy slightly
Using a batch size of 72 instead of 32 achieves slightly better performance

Results

**batch size 36** _MoA prediction_ | plate | mAP model | mAP BM | |:----------|------------:|---------:| | all plates | 0.0573 | 0.04 | _Replicate prediction_ | plate | Training mAP model | Training mAP BM | |:----------|---------------------:|------------------:| | all plates | 0.8 | 0.3 | **batch size 72** _MoA prediction_ | plate | Training mAP model | Training mAP BM | |:----------|------------:|---------:| | all plates | 0.0594 | 0.04 | _Replicate prediction_ | plate | Training mAP model | Training mAP BM | |:----------|-----------------:|------------------:| | all plates | 0.82 | 0.3 | **MLP** | compound | AP | precision at R | |:--------------------------------------------------------------------------|-----:|-----------------:| | phosphodiesterase inhibitor | 0.03 | 0.06 | | estrogen receptor antagonist / selective estrogen receptor modulator (SERM) | 0.03 | 0 | | cyclooxygenase inhibitor | 0.03 | 0.08 | | fungal squalene epoxidase inhibitor | 0.04 | 0 | | isocitrate dehydrogenase inhibitor | 0.05 | 0 | | topoisomerase inhibitor | 0.05 | 0.12 | | ribonucleotide reductase inhibitor | 0.05 | 0 | | diacylglycerol O acyltransferase inhibitor | 0.08 | 0.08 | | AKT inhibitor | 0.09 | 0.29 | | endothelin receptor antagonist | 0.1 | 0.02 | | mTOR inhibitor / PI3K inhibitor | 0.12 | 0.09 | | vitamin D receptor agonist | 0.12 | 0.3 | | p38 MAPK inhibitor | 0.13 | 0.18 | | PLK inhibitor | 0.18 | 0.21 | | JAK inhibitor | 0.19 | 0.36 | | PARP inhibitor | 0.2 | 0.03 | | dihydroorotate dehydrogenase inhibitor | 0.21 | 0.34 | | HSP inhibitor | 0.22 | 0.1 | | dihydrofolate reductase inhibitor | 0.25 | 0.43 | | mTOR inhibitor | 0.29 | 0.52 | | HDAC inhibitor | 0.3 | 0.72 | | CDK inhibitor | 0.3 | 0.43 | | sulfonylurea | 0.33 | 0.28 | | Aurora kinase inhibitor | 0.38 | 0.73 | | voltage-gated sodium channel blocker | 0.39 | 0.19 | | RAF inhibitor | 0.41 | 0.18 | | proteasome inhibitor | 0.41 | 0.19 | | tubulin polymerization inhibitor | 0.45 | 0.43 | | MEK inhibitor | 0.53 | 0.69 | | dehydrogenase inhibitor / inositol monophosphatase inhibitor | 1 | 1 | **BM** | compound | AP | precision at R | |:-----------------------------------------------------------|-----:|-----------------:| | histamine receptor antagonist | 0.02 | 0.03 | | adrenergic receptor antagonist | 0.02 | 0.02 | | cyclooxygenase inhibitor | 0.02 | 0.02 | | Bruton's tyrosine kinase (BTK) inhibitor | 0.02 | 0 | | PI3K inhibitor | 0.02 | 0 | | topoisomerase inhibitor | 0.03 | 0.09 | | phosphodiesterase inhibitor | 0.03 | 0.03 | | HMGCR inhibitor | 0.03 | 0.04 | | fungal squalene epoxidase inhibitor | 0.03 | 0.03 | | cholesterol inhibitor | 0.03 | 0 | | AKT inhibitor | 0.03 | 0.07 | | retinoid receptor agonist | 0.04 | 0.08 | | HIV protease inhibitor | 0.04 | 0.07 | | dihydroorotate dehydrogenase inhibitor | 0.07 | 0.07 | | JAK inhibitor | 0.08 | 0.11 | | mTOR inhibitor / PI3K inhibitor | 0.09 | 0 | | dihydrofolate reductase inhibitor | 0.09 | 0.18 | | mTOR inhibitor | 0.09 | 0.04 | | HDAC inhibitor | 0.14 | 0.43 | | RAF inhibitor | 0.19 | 0.14 | | HSP inhibitor | 0.23 | 0.22 | | MEK inhibitor | 0.26 | 0.5 | | voltage-gated sodium channel blocker | 0.28 | 0.12 | | CDK inhibitor | 0.3 | 0.44 | | PARP inhibitor | 0.31 | 0.31 | | Aurora kinase inhibitor | 0.35 | 0.56 | | PLK inhibitor | 0.37 | 0.54 | | tubulin polymerization inhibitor | 0.43 | 0.52 | | proteasome inhibitor | 0.47 | 0.35 | | dehydrogenase inhibitor / inositol monophosphatase inhibitor | 0.8 | 0.9 |

Plate barcodes

*SQ00014813_SQ00014814_SQ00014815_SQ00014816_SQ00014817_SQ00014818_SQ00014819_SQ00014820_SQ00015041_SQ00015042_SQ00015043_SQ00015044_SQ00015045_SQ00015046_SQ00015047_SQ00015048_SQ00015049_SQ00015050_SQ00015051_SQ00015052_SQ00015053_SQ00015054_SQ00015055_SQ00015056_SQ00015057_SQ00015058_SQ00015059_SQ00015096_SQ00015097_SQ00015098_SQ00015099_SQ00015100_SQ00015101_SQ00015102_SQ00015103_SQ00015105_SQ00015106_SQ00015107_SQ00015108_SQ00015109_SQ00015110_SQ00015111_SQ00015112_SQ00015153_SQ00015167_SQ00015168_SQ00015169_SQ00015170_SQ00015171_SQ00015172_SQ00015173_SQ00015194_SQ00015195_SQ00015196_SQ00015198_SQ00015199_SQ00015200_SQ00015201_SQ00015202_SQ00015203_SQ00015204_SQ00015205_SQ00015206_SQ00015207_SQ00015208_SQ00015209_SQ00015210_SQ00015211_SQ00015212_SQ00015214_SQ00015215_SQ00015216_SQ00015217_SQ00015218_SQ00015219_SQ00015220_SQ00015221_SQ00015222_SQ00015223_SQ00015224_SQ00015230

EchteRobert commented 2 years ago

Experiment 6: test the method on 108 plates with 4691 wells - BS 72

Main takeaways

Similar to the results from Experiment 5.

Next up

I will try some different training approaches to see if I can improve model performance further. However, it's possible that this small increase is all that we can get due to the large number of different compounds in the dataset. Things I want to try:

Only use compounds with X replicates or more (probably with X=4). [Note: I have already tried calculating the mAP using min_replicates=4 and the performance only slightly decreased.]
Train with a subset of compounds. Perhaps this number of compounds is too high making the training task too complex and training on only 90 different compounds, like in the JUMP data, would increase performance.
Add more data augmentation
Add more data
Hyperparameter optimization

I will do the last step last as optimizing these parameters again might mean that the current hyperparameter settings are not robust to new data.

Loss curves

Results

**Replicate prediction** | plate | Training mAP model | Training mAP BM | |:----------|---------------------:|------------------:| | all plates | 0.8 | 0.26 | **MoA prediction** | plate | mAP model | mAP BM | |:----------|------------:|---------:| | all plates | 0.07 | 0.05 | | compound | AP | precision at R | |:-----------------------------------------------------------|-----:|-----------------:| | ribonucleotide reductase inhibitor | 0.06 | 0.03 | | topoisomerase inhibitor | 0.06 | 0.17 | | HIV protease inhibitor | 0.06 | 0.09 | | EGFR inhibitor | 0.07 | 0.23 | | JAK inhibitor | 0.08 | 0.21 | | BCL inhibitor | 0.1 | 0 | | retinoid receptor agonist | 0.1 | 0.39 | | vitamin D receptor agonist | 0.1 | 0.28 | | dihydroorotate dehydrogenase inhibitor | 0.13 | 0.25 | | glucocorticoid receptor agonist | 0.13 | 0.26 | | mTOR inhibitor|PI3K inhibitor | 0.13 | 0 | | voltage-gated sodium channel blocker | 0.15 | 0.04 | | inosine monophosphate dehydrogenase inhibitor | 0.15 | 0 | | AKT inhibitor | 0.16 | 0.41 | | mTOR inhibitor | 0.2 | 0.24 | | HDAC inhibitor | 0.21 | 0.6 | | PLK inhibitor | 0.23 | 0.46 | | PARP inhibitor | 0.25 | 0.42 | | Aurora kinase inhibitor | 0.27 | 0.41 | | dihydrofolate reductase inhibitor | 0.33 | 0.67 | | protein synthesis inhibitor | 0.33 | 0.61 | | CDK inhibitor | 0.33 | 0.54 | | tubulin polymerization inhibitor | 0.55 | 0.65 | | MEK inhibitor | 0.68 | 0.87 | | proteasome inhibitor | 0.69 | 0.68 | | HSP inhibitor | 0.81 | 0.83 | | bacterial 50S ribosomal subunit inhibitor | 0.9 | 0.82 | | MDM inhibitor | 0.94 | 0.98 | | purine antagonist | 0.97 | 1 | | dehydrogenase inhibitor|inositol monophosphatase inhibitor | 1 | 1 | | compound | AP | precision at R | |:--------------------------------------------------------------------------|-----:|-----------------:| | dopamine receptor antagonist | 0.03 | 0.08 | | phosphodiesterase inhibitor | 0.03 | 0.04 | | dihydroorotate dehydrogenase inhibitor | 0.04 | 0.04 | | RAF inhibitor | 0.04 | 0.12 | | EGFR inhibitor | 0.04 | 0.08 | | glucocorticoid receptor agonist | 0.05 | 0.12 | | topoisomerase inhibitor | 0.05 | 0.22 | | estrogen receptor antagonist|selective estrogen receptor modulator (SERM) | 0.06 | 0.08 | | ribonucleotide reductase inhibitor | 0.07 | 0.1 | | JAK inhibitor | 0.07 | 0.15 | | retinoid receptor agonist | 0.07 | 0.22 | | dihydrofolate reductase inhibitor | 0.09 | 0.18 | | mTOR inhibitor|PI3K inhibitor | 0.1 | 0 | | PARP inhibitor | 0.1 | 0.15 | | inosine monophosphate dehydrogenase inhibitor | 0.12 | 0.07 | | mTOR inhibitor | 0.13 | 0.18 | | MDM inhibitor | 0.15 | 0.05 | | bacterial 50S ribosomal subunit inhibitor | 0.15 | 0.2 | | voltage-gated sodium channel blocker | 0.16 | 0.17 | | HDAC inhibitor | 0.18 | 0.6 | | protein synthesis inhibitor | 0.19 | 0.34 | | HSP inhibitor | 0.28 | 0.47 | | Aurora kinase inhibitor | 0.29 | 0.47 | | CDK inhibitor | 0.32 | 0.51 | | MEK inhibitor | 0.38 | 0.74 | | tubulin polymerization inhibitor | 0.43 | 0.54 | | PLK inhibitor | 0.46 | 0.69 | | proteasome inhibitor | 0.6 | 0.59 | | dehydrogenase inhibitor|inositol monophosphatase inhibitor | 0.68 | 0.65 | | purine antagonist | 0.71 | 0.61 |

Plate barcodes

SQ00014813_SQ00014814_SQ00014815_SQ00014816_SQ00014817_SQ00014818_SQ00014819_SQ00014820_SQ00015041_SQ00015042_SQ00015043_SQ00015044_SQ00015045_SQ00015046_SQ00015047_SQ00015048_SQ00015049_SQ00015050_SQ00015051_SQ00015052_SQ00015053_SQ00015054_SQ00015055_SQ00015056_SQ00015057_SQ00015058_SQ00015059_SQ00015096_SQ00015097_SQ00015098_SQ00015099_SQ00015100_SQ00015101_SQ00015102_SQ00015103_SQ00015105_SQ00015106_SQ00015107_SQ00015108_SQ00015109_SQ00015110_SQ00015111_SQ00015112_SQ00015127_SQ00015128_SQ00015129_SQ00015130_SQ00015131_SQ00015132_SQ00015133_SQ00015134_SQ00015135_SQ00015136_SQ00015137_SQ00015138_SQ00015139_SQ00015140_SQ00015141_SQ00015142_SQ00015143_SQ00015144_SQ00015145_SQ00015146_SQ00015147_SQ00015148_SQ00015149_SQ00015150_SQ00015151_SQ00015152_SQ00015153_SQ00015154_SQ00015167_SQ00015168_SQ00015169_SQ00015170_SQ00015171_SQ00015172_SQ00015173_SQ00015194_SQ00015195_SQ00015196_SQ00015198_SQ00015199_SQ00015200_SQ00015201_SQ00015202_SQ00015203_SQ00015204_SQ00015205_SQ00015206_SQ00015207_SQ00015208_SQ00015209_SQ00015210_SQ00015211_SQ00015212_SQ00015214_SQ00015215_SQ00015216_SQ00015217_SQ00015218_SQ00015219_SQ00015220_SQ00015221_SQ00015222_SQ00015223_SQ00015224_SQ00015230

EchteRobert commented 2 years ago

Experiment 6: results updated with randomly shuffled baseline and normalized mAP scores

I recalculated the results with updates to the eval script. A randomly shuffled baseline is also calculated and subtracted from the mAP scores. Here I divide the mAP by the sum of the labels (i.e., the number of positive samples in the rank order). This was incorrect and the mAP is recalculated again in a new comment below (https://github.com/broadinstitute/FeatureAggregation_single_cell/issues/12#issuecomment-1257644268).

Main takeaways

The mAP values are much lower (~3x)
The shuffled value indeed approximates the formula's outcome
Due to the shuffled mAP being so low, I now feel more confident that the model's profiles significantly contribute to the Replicate and MoA prediction.

normalized mAP scores (with shuffled baseline)

_Replicate prediction_ | plate | Training mAP model | Training mAP BM | Training mAP shuffled | |:----------|---------------------:|------------------:|------------------------:| | all plates | 0.2526 | 0.0731 | 0.0002 | Welch's t-test between mlp mAP and bm mAP: Ttest_indResult(statistic=56.98555674994308, pvalue=0.0) _MoA prediction_ | plate | mAP model | mAP BM | mAP shuffled | |:----------|------------:|---------:|---------------:| | all plates | 0.0054 | 0.0032 | 0.0002 | Welch's t-test between mlp mAP and bm mAP: Ttest_indResult(statistic=4.663893266927154, pvalue=3.176144361395387e-06)

Plate barcodes

SQ00014813_SQ00014814_SQ00014815_SQ00014816_SQ00014817_SQ00014818_SQ00014819_SQ00014820_SQ00015041_SQ00015042_SQ00015043_SQ00015044_SQ00015045_SQ00015046_SQ00015047_SQ00015048_SQ00015049_SQ00015050_SQ00015051_SQ00015052_SQ00015053_SQ00015054_SQ00015055_SQ00015056_SQ00015057_SQ00015058_SQ00015059_SQ00015096_SQ00015097_SQ00015098_SQ00015099_SQ00015100_SQ00015101_SQ00015102_SQ00015103_SQ00015105_SQ00015106_SQ00015107_SQ00015108_SQ00015109_SQ00015110_SQ00015111_SQ00015112_SQ00015127_SQ00015128_SQ00015129_SQ00015130_SQ00015131_SQ00015132_SQ00015133_SQ00015134_SQ00015135_SQ00015136_SQ00015137_SQ00015138_SQ00015139_SQ00015140_SQ00015141_SQ00015142_SQ00015143_SQ00015144_SQ00015145_SQ00015146_SQ00015147_SQ00015148_SQ00015149_SQ00015150_SQ00015151_SQ00015152_SQ00015153_SQ00015154_SQ00015167_SQ00015168_SQ00015169_SQ00015170_SQ00015171_SQ00015172_SQ00015173_SQ00015194_SQ00015195_SQ00015196_SQ00015198_SQ00015199_SQ00015200_SQ00015201_SQ00015202_SQ00015203_SQ00015204_SQ00015205_SQ00015206_SQ00015207_SQ00015208_SQ00015209_SQ00015210_SQ00015211_SQ00015212_SQ00015214_SQ00015215_SQ00015216_SQ00015217_SQ00015218_SQ00015219_SQ00015220_SQ00015221_SQ00015222_SQ00015223_SQ00015224_SQ00015230

EchteRobert commented 2 years ago

Experiment 7: training with data with at least 4 replicates

I now filtered the training data for compounds that had at least 4 replicates (though 4 is the highest number of replicates I found in this dataset).

Main takeaways

The results are worse than using all data. I will try training with minimum replicates = 3 and if that does not work I will always include all data in the future.

normalized mAP scores (with shuffled baseline)

_Replicate prediction_ | plate | Training mAP model | Training mAP BM | Training mAP shuffled | |:----------:|-----------------------:|--------------------:|--------------------------:| | all plates | 0.1953 | 0.0731 | 0.0002 | _MoA prediction_ | plate | mAP model | mAP BM | mAP shuffled | |:----------|------------:|---------:|---------------:| | all plates | 0.0046 | 0.0032 | 0.0002 |

Plate barcodes

SQ00014813_SQ00014814_SQ00014815_SQ00014816_SQ00014817_SQ00014818_SQ00014819_SQ00014820_SQ00015041_SQ00015042_SQ00015043_SQ00015044_SQ00015045_SQ00015046_SQ00015047_SQ00015048_SQ00015049_SQ00015050_SQ00015051_SQ00015052_SQ00015053_SQ00015054_SQ00015055_SQ00015056_SQ00015057_SQ00015058_SQ00015059_SQ00015096_SQ00015097_SQ00015098_SQ00015099_SQ00015100_SQ00015101_SQ00015102_SQ00015103_SQ00015105_SQ00015106_SQ00015107_SQ00015108_SQ00015109_SQ00015110_SQ00015111_SQ00015112_SQ00015127_SQ00015128_SQ00015129_SQ00015130_SQ00015131_SQ00015132_SQ00015133_SQ00015134_SQ00015135_SQ00015136_SQ00015137_SQ00015138_SQ00015139_SQ00015140_SQ00015141_SQ00015142_SQ00015143_SQ00015144_SQ00015145_SQ00015146_SQ00015147_SQ00015148_SQ00015149_SQ00015150_SQ00015151_SQ00015152_SQ00015153_SQ00015154_SQ00015167_SQ00015168_SQ00015169_SQ00015170_SQ00015171_SQ00015172_SQ00015173_SQ00015194_SQ00015195_SQ00015196_SQ00015198_SQ00015199_SQ00015200_SQ00015201_SQ00015202_SQ00015203_SQ00015204_SQ00015205_SQ00015206_SQ00015207_SQ00015208_SQ00015209_SQ00015210_SQ00015211_SQ00015212_SQ00015214_SQ00015215_SQ00015216_SQ00015217_SQ00015218_SQ00015219_SQ00015220_SQ00015221_SQ00015222_SQ00015223_SQ00015224_SQ00015230

EchteRobert commented 2 years ago

Experiment 8: training with only 88 different compounds with each 4 replicates

To reduce complexity I only train on 10% of the currently preprocessed data. The idea is to mimic the complexity of the JUMP data. However, the amount of data is drastically reduced with respect to JUMP as there are no replicate plates of the same compounds, reducing the total number of replicates of a single compound to 4 instead of 4 times the number of replicate plates.

Main takeaways

Due to the lack of training samples, this task is more complex than when using more different compounds but also more samples. In the future, I will use all samples I have at my disposal.

Results

mAP results

_Replicate prediction_ Welch's t-test between mlp mAP and bm mAP: Ttest_indResult(statistic=-1.0605392922634074, pvalue=0.28892709582574516) | plate | Training mAP model | Training mAP BM | Training mAP shuffled | Validation mAP model | Validation mAP BM | Validation mAP shuffled | |:---------------|---------------------:|------------------:|------------------------:|-----------------------:|--------------------:|----------------------:| | all plates | 0.1957 | 0.1048 | 0.0002 | 0.057 | 0.0695 | 0.0002 | _MoA prediction_ | compound | AP | precision at R | |:-----------------------------------------------------------|-------:|-----------------:| | purine antagonist | 0.122 | 0.25 | | purine antagonist | 0.1304 | 0.5 | | MDM inhibitor | 0.1343 | 0.5 | | dehydrogenase inhibitor/inositol monophosphatase inhibitor | 0.1353 | 0.5 | | purine antagonist | 0.1411 | 0.75 | | dehydrogenase inhibitor/inositol monophosphatase inhibitor | 0.1433 | 0.75 | | MDM inhibitor | 0.1459 | 0.5 | | purine antagonist | 0.1604 | 0.5 | | MDM inhibitor | 0.162 | 0.75 | | purine antagonist | 0.1652 | 0.75 | | dehydrogenase inhibitor/inositol monophosphatase inhibitor | 0.1702 | 0.75 | | MDM inhibitor | 0.1706 | 0.75 | | dehydrogenase inhibitor/inositol monophosphatase inhibitor | 0.1752 | 0.75 | | purine antagonist | 0.1752 | 0.75 | | MDM inhibitor | 0.1753 | 0.75 | | purine antagonist | 0.176 | 0.75 | | inosine monophosphate dehydrogenase inhibitor | 0.1806 | 0.75 | | dehydrogenase inhibitor/inositol monophosphatase inhibitor | 0.1822 | 1 | | voltage-gated sodium channel blocker | 0.1859 | 0.5 | | dehydrogenase inhibitor/inositol monophosphatase inhibitor | 0.1886 | 1 | | dehydrogenase inhibitor/inositol monophosphatase inhibitor | 0.1933 | 1 | | dehydrogenase inhibitor/inositol monophosphatase inhibitor | 0.1933 | 1 | | MDM inhibitor | 0.1933 | 1 | | dehydrogenase inhibitor/inositol monophosphatase inhibitor | 0.2 | 1 | | dehydrogenase inhibitor/inositol monophosphatase inhibitor | 0.2 | 1 | | MDM inhibitor | 0.2 | 1 | | MDM inhibitor | 0.2 | 1 | | purine antagonist | 0.2 | 1 | | MDM inhibitor | 0.2 | 1 | | purine antagonist | 0.25 | 1 | | compound | AP | precision at R | |:-----------------------------------------------------------|-------:|-----------------:| | voltage-gated sodium channel blocker | 0.0562 | 0.25 | | proteasome inhibitor | 0.0577 | 0.5 | | proteasome inhibitor | 0.0596 | 1 | | protein synthesis inhibitor | 0.0602 | 1 | | inosine monophosphate dehydrogenase inhibitor | 0.0607 | 0 | | inosine monophosphate dehydrogenase inhibitor | 0.0625 | 0 | | bacterial 50S ribosomal subunit inhibitor | 0.0753 | 0.5 | | purine antagonist | 0.0797 | 0.25 | | purine antagonist | 0.0798 | 0.25 | | dehydrogenase inhibitor/inositol monophosphatase inhibitor | 0.085 | 0.25 | | MDM inhibitor | 0.0894 | 0.25 | | inosine monophosphate dehydrogenase inhibitor | 0.0907 | 0.25 | | purine antagonist | 0.1008 | 0.5 | | bacterial 50S ribosomal subunit inhibitor | 0.1115 | 0.5 | | voltage-gated sodium channel blocker | 0.1127 | 0.25 | | purine antagonist | 0.1128 | 0.5 | | voltage-gated sodium channel blocker | 0.1137 | 0.25 | | dehydrogenase inhibitor/inositol monophosphatase inhibitor | 0.1198 | 0.5 | | dehydrogenase inhibitor/inositol monophosphatase inhibitor | 0.145 | 0.5 | | dehydrogenase inhibitor/inositol monophosphatase inhibitor | 0.1525 | 0.75 | | dehydrogenase inhibitor/inositol monophosphatase inhibitor | 0.1531 | 0.75 | | dehydrogenase inhibitor/inositol monophosphatase inhibitor | 0.1534 | 0.75 | | purine antagonist | 0.1545 | 0.25 | | dehydrogenase inhibitor/inositol monophosphatase inhibitor | 0.1633 | 1 | | dehydrogenase inhibitor/inositol monophosphatase inhibitor | 0.1661 | 1 | | dehydrogenase inhibitor/inositol monophosphatase inhibitor | 0.1689 | 0.75 | | purine antagonist | 0.2219 | 0.75 | | purine antagonist | 0.25 | 1 | | purine antagonist | 0.25 | 1 | | purine antagonist | 0.25 | 1 | Welch's t-test between mlp mAP and bm mAP: Ttest_indResult(statistic=1.9471959035173303, pvalue=0.05155618670367708) | plate | mAP model | mAP BM | mAP shuffled | |:------------|------------:|---------:|---------------:| | all plates | 0.004 | 0.0032 | 0.0002 |

Plate barcodes

SQ00014813_SQ00014814_SQ00014815_SQ00014816_SQ00014817_SQ00014818_SQ00014819_SQ00014820_SQ00015041_SQ00015042_SQ00015043_SQ00015044_SQ00015045_SQ00015046_SQ00015047_SQ00015048_SQ00015049_SQ00015050_SQ00015051_SQ00015052_SQ00015053_SQ00015054_SQ00015055_SQ00015056_SQ00015057_SQ00015058_SQ00015059_SQ00015096_SQ00015097_SQ00015098_SQ00015099_SQ00015100_SQ00015101_SQ00015102_SQ00015103_SQ00015105_SQ00015106_SQ00015107_SQ00015108_SQ00015109_SQ00015110_SQ00015111_SQ00015112_SQ00015127_SQ00015128_SQ00015129_SQ00015130_SQ00015131_SQ00015132_SQ00015133_SQ00015134_SQ00015135_SQ00015136_SQ00015137_SQ00015138_SQ00015139_SQ00015140_SQ00015141_SQ00015142_SQ00015143_SQ00015144_SQ00015145_SQ00015146_SQ00015147_SQ00015148_SQ00015149_SQ00015150_SQ00015151_SQ00015152_SQ00015153_SQ00015154_SQ00015167_SQ00015168_SQ00015169_SQ00015170_SQ00015171_SQ00015172_SQ00015173_SQ00015194_SQ00015195_SQ00015196_SQ00015198_SQ00015199_SQ00015200_SQ00015201_SQ00015202_SQ00015203_SQ00015204_SQ00015205_SQ00015206_SQ00015207_SQ00015208_SQ00015209_SQ00015210_SQ00015211_SQ00015212_SQ00015214_SQ00015215_SQ00015216_SQ00015217_SQ00015218_SQ00015219_SQ00015220_SQ00015221_SQ00015222_SQ00015223_SQ00015224_SQ00015230

EchteRobert commented 2 years ago

Experiment 6: recalculating the mAP again

I now calculate the mAP by correcting it with the random baseline: mAP = mAP - sum(labels)/len(labels) This increases the mAP again as the random baseline is very low: ~0.00086 on average.

Main takeaways

The mAP is now closer to what would be expected according to the results found in the LINCS manuscript, where the majority of the results are found between 0 and 0.1.

Results replicate prediction

Welch's t-test between mlp mAP and bm mAP: Ttest_indResult(statistic=89.0721529525361, pvalue=0.0) | plate | Training mAP model | Training mAP BM | Training mAP shuffled | |:----------:|-----------------------:|--------------------:|--------------------------:| | all plates | 0.8006 | 0.2555 | 0 |

Results MoA prediction

| compound | AP | precision at R | |:-----------------------------------------------------------|-------:|-----------------:| | glucose dependent insulinotropic receptor agonist | 0.0464 | 0.0317 | | ribonucleotide reductase inhibitor | 0.0526 | 0.0471 | | HIV protease inhibitor | 0.0535 | 0.1053 | | EGFR inhibitor | 0.0549 | 0.0671 | | JAK inhibitor | 0.073 | 0.0957 | | retinoid receptor agonist | 0.0942 | 0.129 | | BCL inhibitor | 0.0944 | 0 | | vitamin D receptor agonist | 0.1001 | 0.1015 | | dihydroorotate dehydrogenase inhibitor | 0.1253 | 0.1786 | | glucocorticoid receptor agonist | 0.1302 | 0.1813 | | mTOR inhibitor/PI3K inhibitor | 0.1317 | 0.0772 | | voltage-gated sodium channel blocker | 0.1508 | 0 | | inosine monophosphate dehydrogenase inhibitor | 0.1535 | 0 | | AKT inhibitor | 0.1546 | 0.128 | | mTOR inhibitor | 0.1953 | 0.29 | | HDAC inhibitor | 0.2015 | 0.238 | | PLK inhibitor | 0.2221 | 0.212 | | PARP inhibitor | 0.252 | 0.3376 | | Aurora kinase inhibitor | 0.2659 | 0.342 | | CDK inhibitor | 0.3264 | 0.3326 | | dihydrofolate reductase inhibitor | 0.3312 | 0.3333 | | protein synthesis inhibitor | 0.3317 | 0.3492 | | tubulin polymerization inhibitor | 0.5497 | 0.512 | | MEK inhibitor | 0.6669 | 0.6212 | | proteasome inhibitor | 0.6842 | 0.6451 | | HSP inhibitor | 0.8049 | 0.7313 | | bacterial 50S ribosomal subunit inhibitor | 0.8962 | 0.86 | | MDM inhibitor | 0.9344 | 0.86 | | purine antagonist | 0.9699 | 0.9333 | | dehydrogenase inhibitor/inositol monophosphatase inhibitor | 0.9989 | 1 | | compound | AP | precision at R | |:--------------------------------------------------------------------------|-------:|-----------------:| | HMGCR inhibitor | 0.0234 | 0.0381 | | AKT inhibitor | 0.0234 | 0.026 | | EGFR inhibitor | 0.03 | 0.0588 | | dihydroorotate dehydrogenase inhibitor | 0.0357 | 0.0369 | | topoisomerase inhibitor | 0.0378 | 0.0551 | | RAF inhibitor | 0.0389 | 0.0536 | | glucocorticoid receptor agonist | 0.0445 | 0.0551 | | estrogen receptor antagonist/selective estrogen receptor modulator (SERM) | 0.0624 | 0.0933 | | retinoid receptor agonist | 0.0627 | 0.1038 | | ribonucleotide reductase inhibitor | 0.0628 | 0.0647 | | JAK inhibitor | 0.0653 | 0.0792 | | dihydrofolate reductase inhibitor | 0.0845 | 0.0933 | | mTOR inhibitor/PI3K inhibitor | 0.0987 | 0.0963 | | PARP inhibitor | 0.1022 | 0.1293 | | inosine monophosphate dehydrogenase inhibitor | 0.1213 | 0.0571 | | mTOR inhibitor | 0.1301 | 0.1767 | | MDM inhibitor | 0.1475 | 0.06 | | bacterial 50S ribosomal subunit inhibitor | 0.1486 | 0.18 | | voltage-gated sodium channel blocker | 0.1605 | 0.1667 | | HDAC inhibitor | 0.1727 | 0.2192 | | protein synthesis inhibitor | 0.1886 | 0.1825 | | HSP inhibitor | 0.2751 | 0.2518 | | Aurora kinase inhibitor | 0.2848 | 0.297 | | CDK inhibitor | 0.3093 | 0.3482 | | MEK inhibitor | 0.3672 | 0.3741 | | tubulin polymerization inhibitor | 0.4277 | 0.428 | | PLK inhibitor | 0.4559 | 0.466 | | proteasome inhibitor | 0.6011 | 0.5735 | | dehydrogenase inhibitor/inositol monophosphatase inhibitor | 0.675 | 0.58 | | purine antagonist | 0.7069 | 0.6 | Welch's t-test between mlp mAP and bm mAP: Ttest_indResult(statistic=5.879117847435402, pvalue=4.351771882547785e-09) | plate | mAP model | mAP BM | mAP shuffled | |:------------|------------:|---------:|---------------:| | all plates | 0.0617 | 0.0401 | 0 |

EchteRobert commented 2 years ago

Experiment 9: Increasing the level of data augmentation

I reran Experiment 6, but now using 8 sets of cells sampled per compound type (instead of 4). To keep the true batch size the same, I reduced the batch size from 72 to 36. The model crashed near the end so the results are inconclusive, but may still give an idea of which direction this type of training is headed.

Main takeaways

Although training loss and replicate prediction performance are worse than Experiment 6 and even though training was not completed, the mAP for MoA prediction is higher than before (0.0658 instead of 0.0617). This indicates that adding more data augmentation will benefit generalization.

Results mAP MoA prediction

| compound | AP | precision at R | |:--------------------------------------------------------------------------|-------:|-----------------:| | tachykinin antagonist | 0.0495 | 0.0943 | | RAF inhibitor | 0.0548 | 0.11 | | topoisomerase inhibitor | 0.0659 | 0.1033 | | retinoid receptor agonist | 0.0712 | 0.1325 | | BCL inhibitor | 0.08 | 0 | | HIV protease inhibitor | 0.0844 | 0.1165 | | JAK inhibitor | 0.0859 | 0.1197 | | vitamin D receptor agonist | 0.0992 | 0.1015 | | glucocorticoid receptor agonist | 0.1144 | 0.1665 | | estrogen receptor antagonist/selective estrogen receptor modulator (SERM) | 0.1304 | 0.2267 | | AKT inhibitor | 0.1321 | 0.102 | | mTOR inhibitor/PI3K inhibitor | 0.1475 | 0.1123 | | voltage-gated sodium channel blocker | 0.1588 | 0.0556 | | inosine monophosphate dehydrogenase inhibitor | 0.1797 | 0 | | mTOR inhibitor | 0.2365 | 0.3367 | | HDAC inhibitor | 0.2613 | 0.2955 | | Aurora kinase inhibitor | 0.2933 | 0.3032 | | PLK inhibitor | 0.2997 | 0.31 | | dihydrofolate reductase inhibitor | 0.3324 | 0.3333 | | PARP inhibitor | 0.342 | 0.3536 | | protein synthesis inhibitor | 0.3619 | 0.381 | | CDK inhibitor | 0.3654 | 0.4187 | | tubulin polymerization inhibitor | 0.5223 | 0.442 | | MDM inhibitor | 0.6252 | 0.5 | | MEK inhibitor | 0.7936 | 0.7151 | | HSP inhibitor | 0.8016 | 0.7122 | | proteasome inhibitor | 0.8687 | 0.799 | | bacterial 50S ribosomal subunit inhibitor | 0.948 | 0.92 | | dehydrogenase inhibitor/inositol monophosphatase inhibitor | 0.9766 | 0.96 | | purine antagonist | 0.999 | 1 | | compound | AP | precision at R | |:--------------------------------------------------------------------------|-------:|-----------------:| | HMGCR inhibitor | 0.0234 | 0.0381 | | AKT inhibitor | 0.0234 | 0.026 | | EGFR inhibitor | 0.03 | 0.0588 | | dihydroorotate dehydrogenase inhibitor | 0.0357 | 0.0369 | | topoisomerase inhibitor | 0.0378 | 0.0551 | | RAF inhibitor | 0.0389 | 0.0536 | | glucocorticoid receptor agonist | 0.0445 | 0.0551 | | estrogen receptor antagonist/selective estrogen receptor modulator (SERM) | 0.0624 | 0.0933 | | retinoid receptor agonist | 0.0627 | 0.1038 | | ribonucleotide reductase inhibitor | 0.0628 | 0.0647 | | JAK inhibitor | 0.0653 | 0.0792 | | dihydrofolate reductase inhibitor | 0.0845 | 0.0933 | | mTOR inhibitor/PI3K inhibitor | 0.0987 | 0.0963 | | PARP inhibitor | 0.1022 | 0.1293 | | inosine monophosphate dehydrogenase inhibitor | 0.1213 | 0.0571 | | mTOR inhibitor | 0.1301 | 0.1767 | | MDM inhibitor | 0.1475 | 0.06 | | bacterial 50S ribosomal subunit inhibitor | 0.1486 | 0.18 | | voltage-gated sodium channel blocker | 0.1605 | 0.1667 | | HDAC inhibitor | 0.1727 | 0.2192 | | protein synthesis inhibitor | 0.1886 | 0.1825 | | HSP inhibitor | 0.2751 | 0.2518 | | Aurora kinase inhibitor | 0.2848 | 0.297 | | CDK inhibitor | 0.3093 | 0.3482 | | MEK inhibitor | 0.3672 | 0.3741 | | tubulin polymerization inhibitor | 0.4277 | 0.428 | | PLK inhibitor | 0.4559 | 0.466 | | proteasome inhibitor | 0.6011 | 0.5735 | | dehydrogenase inhibitor/inositol monophosphatase inhibitor | 0.675 | 0.58 | | purine antagonist | 0.7069 | 0.6 | Total mean mAP shuffled: 0.0 Total mean precision at R shuffled: 0.014760876116724603 Welch's t-test between mlp mAP and bm mAP: Ttest_indResult(statistic=6.754822157536942, pvalue=1.5725563418210224e-11) | plate | mAP model | mAP BM | mAP shuffled | |:--------------------|------------:|---------:|---------------:| | all plates | 0.0658 | 0.0401 | 0 |

Results mAP replicate prediction

Welch's t-test between mlp mAP and bm mAP: Ttest_indResult(statistic=78.30556567539497, pvalue=0.0) | plate | Training mAP model | Training mAP BM | Training mAP shuffled | |:---------------:|-----------------------:|--------------------:|--------------------------:| | all plates | 0.7532 | 0.2555 | 0 |

EchteRobert commented 2 years ago

Experiment 9: continued

I reran the evaluation after completing the training loop (100 epochs) for the model described in Experiment 9. Because I also preprocessed all remaining plates overnight, the evaluation is done on some plates which were not included in training resulting in the evaluation of some compounds which were not seen before. This may deflate the mAP numbers somewhat.

Main takeaways

Although training loss and replicate prediction performance are worse than in Experiment 6, the mAP for MoA prediction is higher. This indicates that adding more data augmentation will benefit generalization.

Next up

Train a model on ALL LINCS data!

MoA prediction results

| compound | AP | precision at R | |:--------------------------------------------------------------------------|-------:|-----------------:| | tachykinin antagonist | 0.0603 | 0.1031 | | PARP inhibitor | 0.067 | 0.1058 | | glucose dependent insulinotropic receptor agonist | 0.0734 | 0.08 | | RAF inhibitor | 0.0813 | 0.123 | | glucocorticoid receptor agonist | 0.0833 | 0.11 | | HMGCR inhibitor | 0.0858 | 0.1425 | | estrogen receptor antagonist/selective estrogen receptor modulator (SERM) | 0.0889 | 0.1533 | | BCL inhibitor | 0.0943 | 0.0222 | | vitamin D receptor agonist | 0.0984 | 0.1203 | | inosine monophosphate dehydrogenase inhibitor | 0.1068 | 0 | | AKT inhibitor | 0.1168 | 0.1 | | topoisomerase inhibitor | 0.1234 | 0.1709 | | voltage-gated sodium channel blocker | 0.1564 | 0.06 | | Aurora kinase inhibitor | 0.1633 | 0.2256 | | mTOR inhibitor/PI3K inhibitor | 0.1637 | 0.1033 | | mTOR inhibitor | 0.2132 | 0.2067 | | PLK inhibitor | 0.2181 | 0.224 | | HDAC inhibitor | 0.2579 | 0.292 | | CDK inhibitor | 0.2926 | 0.3062 | | dihydrofolate reductase inhibitor | 0.329 | 0.3333 | | protein synthesis inhibitor | 0.3963 | 0.3968 | | focal adhesion kinase inhibitor | 0.557 | 0.5 | | tubulin polymerization inhibitor | 0.5715 | 0.528 | | bacterial 50S ribosomal subunit inhibitor | 0.7708 | 0.68 | | MEK inhibitor | 0.7752 | 0.6696 | | proteasome inhibitor | 0.8682 | 0.8133 | | HSP inhibitor | 0.8813 | 0.8231 | | MDM inhibitor | 0.941 | 0.88 | | dehydrogenase inhibitor/inositol monophosphatase inhibitor | 0.9924 | 0.96 | | purine antagonist | 0.9992 | 1 | | compound | AP | precision at R | |:--------------------------------------------------------------------------|-------:|-----------------:| | EGFR inhibitor | 0.0289 | 0.0623 | | dihydroorotate dehydrogenase inhibitor | 0.0338 | 0.0467 | | prostanoid receptor agonist | 0.0361 | 0.02 | | glucocorticoid receptor agonist | 0.0379 | 0.056 | | PARP inhibitor | 0.0435 | 0.0642 | | retinoid receptor agonist | 0.0577 | 0.0949 | | estrogen receptor antagonist/selective estrogen receptor modulator (SERM) | 0.0584 | 0.0867 | | JAK inhibitor | 0.0693 | 0.094 | | voltage-gated sodium channel blocker | 0.0701 | 0.08 | | ribonucleotide reductase inhibitor | 0.0786 | 0.0967 | | topoisomerase inhibitor | 0.0806 | 0.1142 | | dihydrofolate reductase inhibitor | 0.0823 | 0.0933 | | mTOR inhibitor | 0.1165 | 0.17 | | MDM inhibitor | 0.1215 | 0.06 | | mTOR inhibitor/PI3K inhibitor | 0.1216 | 0.12 | | focal adhesion kinase inhibitor | 0.1242 | 0.0286 | | HMGCR inhibitor | 0.1366 | 0.1579 | | bacterial 50S ribosomal subunit inhibitor | 0.1371 | 0.18 | | HDAC inhibitor | 0.1632 | 0.2094 | | inosine monophosphate dehydrogenase inhibitor | 0.1757 | 0.06 | | protein synthesis inhibitor | 0.1882 | 0.1825 | | Aurora kinase inhibitor | 0.2571 | 0.2756 | | CDK inhibitor | 0.2818 | 0.3196 | | HSP inhibitor | 0.3248 | 0.301 | | MEK inhibitor | 0.3387 | 0.3488 | | tubulin polymerization inhibitor | 0.4142 | 0.4347 | | PLK inhibitor | 0.419 | 0.442 | | proteasome inhibitor | 0.633 | 0.6233 | | dehydrogenase inhibitor/inositol monophosphatase inhibitor | 0.6606 | 0.58 | | purine antagonist | 0.7005 | 0.6 | Exp. 9: Welch's t-test between mlp mAP and bm mAP: Ttest_indResult(statistic=6.830722767928342, pvalue=9.18749483493789e-12) Exp. 6: Welch's t-test between mlp mAP and bm mAP: Ttest_indResult(statistic=6.310555991504098, pvalue=2.9500067552577934e-10) | plate | mAP model | mAP BM | mAP shuffled | |:-----------------|------------:|---------:|---------------:| | all plates Exp. 9 | 0.0592 | 0.0367 | 0 | | all plates Exp. 6 | 0.0566 | 0.0367 | 0 |

Replicate prediction results

Exp. 9: Welch's t-test between mlp mAP and bm mAP: Ttest_indResult(statistic=78.43263356928172, pvalue=0.0) Exp. 6: Welch's t-test between mlp mAP and bm mAP: Ttest_indResult(statistic=82.86209673792646, pvalue=0.0) | plate | Training mAP model | Training mAP BM | Training mAP shuffled | |:-----------|---------------------:|------------------:|------------------------:| | all plates Exp. 9 | 0.7381 | 0.2698 | 0 | | all plates Exp. 6 | 0.7592 | 0.2698 | 0 |

carpenter-singh-lab / 2024_vanDijk_PLoS_CytoSummaryNet