Open EchteRobert opened 2 years ago
I will start off with plates SQ00015231, SQ00015232, and SQ00015233. First I will have to remove all low dose (<3.33 µM) wells from each plate. Then I will train the model as usual, using the highest dose (10 µM) replicates for the contrastive learning objective. Finally, the model will be evaluated on MoA prediction using the highest dose (10 µM) wells. I will not use the 3.33 µM wells during training/validation and only use those during the final testing phase.
I have now updated the scripts to work for the LINCS data, but I still need to filter the dose points before training. This is next up on the list.
I have already trained a model using the original setup but without filtering doses. This means that replicate compound profiles are forced to attract during training, even if they were created with different doses.
The replicate mAP was very high but no improvement was found in MoA prediction (both were also evaluated using different dose points).
I trained a model on plates SQ00015231, SQ00015232, and SQ00015233 using only the 10 µM dose point and training on forming replicates between compounds across plates. The latter is new compared to previous experiments. Earlier on when experimenting on the Stain datasets, I found that training on across plate replicates reduced the model's ability to generalize to hold-out compounds. This may also negatively influence the model's ability to predict MoAs. It's possible that more plates are needed for generalization anyways (3 plates did not work well for Stain data either).
The model improves replicate prediction significantly: it's nearly perfect (1.0 mAP). However, as expected it does not improve MoA prediction. I will probably have to use more plates during training to improve generalization.
It looks like the LINCS results were generated by
From the manuscript:
"Because the negative control DMSO profiles were noisy due to technical artifacts, we applied a spherize transform (also known as whitening) to mitigate the impact of well positioning. More specifically, we used the zero-phase whitening filters (ZCA) solution calculated on the profile correlation matrix (ZCA-cor) to minimize the absolute distance between the transformed profiles and the untransformed profiles. We also formed consensus signatures (level 5) by moderated z-score (MODZ) aggregating all replicate wells across plate maps into a single signature. We applied feature selection to the consensus signatures and batch effect corrected profiles separately using the same operations as described above. We applied the same pipeline to batch 1 (A549) and batch 2 (A549, MCF7, and U2OS) Cell Painting datasets."
Using plates SQ00015224, SQ00015223, SQ00015230, SQ00015231, SQ00015229, SQ00015233, and SQ00015232 as the training data and considering across plate replicates, the model was trained in the same way as in Exp. 1. Using more plates during training is expected to improve MoA prediction performance compared to Exp. 1.
The model is still not generalizing to MoA prediction. This is probably due to the lack of data. In the next experiment I will try to use ~30 plates.
The model is converging slower than the previous models - I think because there are more compounds that can be used for replicate training, increasing the training task complexity. We can also see that there are more MoAs that are taken into account for the evaluation of the MoA prediction task. The model is beating BM performance in the replicating task, but not in the MoA prediction task - the performance for this task remains near random (as in Exp. 1). The total number of samples, due to filtering of lower dose points, is only the size of one plate (~360 samples). This could explain the lack of generalization of the model. It's possible I will need to at least use 5 times as many plates (~30).
I have now moved the python environment and all of the data to John's server to accommodate for the larger memory requirements of the LINCS dataset. Effectively these 26 plates only correspond to ~4 plates in training data, which is nowhere close to the 15 I used for training the final model on the Stain datasets. I have monitored all the steps I had to take and updated those in the README of this repository for future reference. The model was trained using the same default hyperparameters as before. The updated pipeline uses 1781 features instead of 1783, possibly due to some Image or Metadata features being filtered out correctly now.
Something went wrong during training. I expect the problem to lie with the data. I will investigate the issue and improve the training pipeline.
After investigating a bit, the issue might be related to same compounds getting different labels.
The loss curves show that the model was not able to learn the task correctly. The validation mAP does keep increasing (although on a much smaller scale than it would normally), which may indicate that the training procedure is correct. I will investigate the training data to see if something is wrong there.
I resolved the issue and reran the experiment as described in https://github.com/broadinstitute/FeatureAggregation_single_cell/issues/12#issuecomment-1236726249. I now aggregate all the wells from all the plates first. I now also add the perturbation and moa information per well and then remove any wells that do not contain any perturbation information. Finally, I remove all compounds which occur only once in the entire dataset. This results in:
Removed 318 wells due to missing annotation of pert_iname and moa.
Removed 663 unique compound wells.
Using 458 wells
Note that this is only equal to ~1.5 plates worth of data!
The model is still not better at MoA prediction than the benchmark. In previous experiments, the model only started beating this performance after I started using >~9 plates. Another method would be to add more data augmentation, i.e., random sampling from the wells.
The model now trains correctly and we can see that the loss curves converge properly. Percent replicating is nearly perfect, while the model still lacks improved MoA prediction compared to the benchmark.
I added more plates and repeated the experiment. The number of wells used is still only equal to ~3 plates total. I will have to add even more to make this work.
Removed 477 wells due to missing annotation of pert_iname and moa.
Removed 624 unique compound wells.
Using 1008 wells
Removed 958 wells due to missing annotation of pert_iname and moa.
Removed 226 unique compound wells.
Using wells 3306
Loading 81 plates with 783 unique compounds.
Similar to the results from Experiment 5.
I will try some different training approaches to see if I can improve model performance further. However, it's possible that this small increase is all that we can get due to the large number of different compounds in the dataset. Things I want to try:
I will do the last step last as optimizing these parameters again might mean that the current hyperparameter settings are not robust to new data.
I recalculated the results with updates to the eval script. A randomly shuffled baseline is also calculated and subtracted from the mAP scores. Here I divide the mAP by the sum of the labels (i.e., the number of positive samples in the rank order). This was incorrect and the mAP is recalculated again in a new comment below (https://github.com/broadinstitute/FeatureAggregation_single_cell/issues/12#issuecomment-1257644268).
I now filtered the training data for compounds that had at least 4 replicates (though 4 is the highest number of replicates I found in this dataset).
The results are worse than using all data. I will try training with minimum replicates = 3
and if that does not work I will always include all data in the future.
To reduce complexity I only train on 10% of the currently preprocessed data. The idea is to mimic the complexity of the JUMP data. However, the amount of data is drastically reduced with respect to JUMP as there are no replicate plates of the same compounds, reducing the total number of replicates of a single compound to 4 instead of 4 times the number of replicate plates.
Due to the lack of training samples, this task is more complex than when using more different compounds but also more samples. In the future, I will use all samples I have at my disposal.
I now calculate the mAP by correcting it with the random baseline:
mAP = mAP - sum(labels)/len(labels)
This increases the mAP again as the random baseline is very low: ~0.00086 on average.
The mAP is now closer to what would be expected according to the results found in the LINCS manuscript, where the majority of the results are found between 0 and 0.1.
I reran Experiment 6, but now using 8 sets of cells sampled per compound type (instead of 4). To keep the true batch size the same, I reduced the batch size from 72 to 36. The model crashed near the end so the results are inconclusive, but may still give an idea of which direction this type of training is headed.
Although training loss and replicate prediction performance are worse than Experiment 6 and even though training was not completed, the mAP for MoA prediction is higher than before (0.0658 instead of 0.0617). This indicates that adding more data augmentation will benefit generalization.
I reran the evaluation after completing the training loop (100 epochs) for the model described in Experiment 9. Because I also preprocessed all remaining plates overnight, the evaluation is done on some plates which were not included in training resulting in the evaluation of some compounds which were not seen before. This may deflate the mAP numbers somewhat.
Although training loss and replicate prediction performance are worse than in Experiment 6, the mAP for MoA prediction is higher. This indicates that adding more data augmentation will benefit generalization.
Train a model on ALL LINCS data!
LINCS contains 6 dose points: 0.04 µM, 0.12 µM, 0.37 µM, 1.11 µM, 3.33 µM, and 10 µM. For my experiments, I will use the highest dose (10 µM) as the training set and the validation set. The model is trained to create profiles that attract replicate compound profiles and repel non-replicate compound profiles. It is then validated by evaluating the ability of these profiles to predict MoAs (or find sister compounds). Finally, the model will be tested on the 3.3 µM dose point data as a hold-out set. This data should look significantly different from the training and validation data.
I will follow the same data exclusion protocol as Michael did in his research:
Which should result in similar data numbers:
I have put some relevant quotes from Michael's work and the LINCS manuscript here: https://docs.google.com/document/d/1z2U5o91vzBwB-4xtryYn5d3kSWJi8ZSerE_MAzdLT_0/edit#heading=h.sbi6l2r6p5ec