Data Processing and Machine Learning Approaches

This issue will discuss the machine learning approach to predict cell health labels using cell painting features.

Goal

What is the extent we can predict certain cell health outcomes? The cell health outcomes are described in feature_mapping_annotated.

Data

We have cell painting and cell health readout data for the same three cell lines (A549, ES2, and HCC44). In each data type and using CRISPR, collaborators knocked down a total of 59 genes and controls using 119 different guides.

Cell Painting

Cell painting data were acquired across these guides and cell lines. There were about 6 replicates per guide. Because we cannot map wells between experiments, and can only compare at the condition level, we collapsed replicate guides into median profiles.

This resulted in a 357 x 247 (profiles by features) matrix (119 guides * 3 cell lines).

Cell Health

Cell health assay readouts were collected by other collaborators and represent 72 different cell health readouts. We also median collapsed measurements across guides. There were about 4 replicates per guide.

The final cell health target matrix was 357 x 72.

Training and Testing Split

In #22, we split the data randomly into 85% training and 15% testing sets.

Machine Learning

Our goal is to assess how well the cell painting features can capture the signal of each of the 72 cell health outputs.

We approached this machine learning problem in three different ways:

Raw cell health measurements
Transform cell health measurements to a scale between 0 and 1
Binarize cell health labels into high/low
- I am currently using kmeans to find two clusters in each of the 72 cell health labels independently.

The first two approaches require a regression approach, while the third can be approached as a classification problem.

broadinstitute / cell-health