02. Model for CPJUMP1 + Stain2

The first results here are generated by the lastly presented model (BS64) in https://github.com/broadinstitute/FeatureAggregation_single_cell/issues/1. Note that Stain2 confocal data contains 1433 features instead of the 1938 in CPJUMP1. So some features are randomly selected and repeated to feed the data to the model.

Main takeaways

As expected, the model does not directly translate to the Stain2 dataset. I expect the cause to be two-fold:

Stain2 has measured less and perhaps even different features, the model has likely not learned to interpret general feature distributions yet but has overfit to the features present in CPJUMP1.
Stain2 contains different compounds, which contain different feature distributions than the compounds found in CPJUMP1.

Next steps: find a way for the model to generalize to different features and compounds.

_Model trained on CPJUMP1 compound data on Stain2_Batch2Binned PR // BENCHMARK: 0.6 PS MLP_CPJUMP1_BS64_PR

_Model trained on CPJUMP1 compound data on Stain2_Batch2Confocal PR // BENCHMARK: 0.533 MLP_CPJUMP1_BS64_Confocal_PR

One major issue, as pointed out above, is that the feature dimension will change based on the batch. To remedy this the model would ideally be invariant to the input size altogether. Another possible solution is to perform feature selection before feeding the data to the network. However, this is time-consuming and I simply do not have the storage capacity to keep creating new datasets. Instead I will try to make the model invariant to the input size altogether.

Experiment

An input size invariant model is created by using adaptive max pooling (feature size 800) on the feature dimension. This experiment tests if this model can still achieve a similar performance to the previous best model. Interestingly, given the feature-wise normalization of the data before feeding it to the model, a max pooling operation at the start of the model will do nothing more than semi-randomly select one of the features for each pooling operation.

Main takeaways

The training and validation curves are similar to that of the previous benchmark model. Showing that the adaptive max pooling layer does not inhibit model training.
PR is actually slightly (~2%) higher on the CPJUMP1 dataset, but more importantly PR is higher for both Stain2 batches, binned and confocal. It still is far away from the benchmark, but it's possible that the randomness of the adaptive max pooling allows the model to generalize better to features.

Loss curves

PR on CPJUMP1 compounds data MLPadapt_CPJUMP1

PR on Stain2 binned MLPadapt_S2binned

PR on Stain2 confocal MLPadapt_S2confocal

carpenter-singh-lab / 2024_vanDijk_PLoS_CytoSummaryNet