Populations of cells can be perturbed by various chemical and genetic treatments and the impact on the cells’ gene expression (transcription, i.e. mRNA levels) and morphology (in an image-based assay) can be measured in high dimensions. The patterns observed in this data can be used for more than a dozen applications in drug discovery and basic biology research. We provide a collection of four datasets where both gene expression and morphological data are available; roughly a thousand features are measured for each data type, across more than 28,000 thousand chemical and genetic perturbations. We have defined a set of biological problems that can be investigated using these two data modalities and provided baseline analysis and evaluation metrics for addressing each.
We have gathered the following five available data sets that had both Cell Painting morphological (CP) and L1000 gene expression (GE) profiles, preprocessed the data from different sources and in different formats in a unified .csv format.
Preprocessed profiles (~9.5GB) are available on a S3 bucket. They can be downloaded at no cost and no need for registration of any sort, using the command:
aws s3 sync \
--no-sign-request \
s3://cellpainting-gallery/cpg0003-rosetta/broad/workspace/preprocessed_data .
See this wiki for sample Cell Painting images and the meaning of (CellProfiler-derived) Cell Painting features.
The Etags of these files are listed here.
They were generated using:
aws s3api list-objects --bucket cellpainting-gallery --prefix rosetta/broad/workspace/preprocessed_data/
We gathered four available data sets that had both Cell Painting morphological (CP) and L1000 gene expression (GE) profiles, preprocessed the data from different sources and in different formats in a unified .csv format, and made the data publicly available. Single cell morphological (CP) profiles were created using CellProfiler software and processed to form aggregated replicate and treatment levels using the R cytominer package cytominer. We made the following three types of profiles available for cell-painting modality of each of four datasets:
Folder | File name | Description |
---|---|---|
CellPainting | replicate_level_cp_augmented.csv |
Aggregated and Metadata annotated profiles which are the average of single cell profiles in each well. |
CellPainting | replicate_level_cp_normalized.csv.gz |
Normalized profiles which are the z-scored aggregated profiles, where the scores are computing using the distribution of negative controls as the reference. |
CellPainting | replicate_level_cp_normalized_variable_selected.csv.gz |
Normalized variable selected which are normalized profiles with features selection applied |
L1000 | replicate_level_l1k.csv |
Aggregated and Metadata annotated profiles which are the average of single cell profiles in each well. |
This spreadsheet contains a description all the metadata fields across all 8 datasets.
Dataset | perturbation match column CP |
perturbation match column GE |
Control perturbation value in each of columns CP and GE |
---|---|---|---|
CDRP-BBBC047-Bray | Metadata_Sample_Dose | pert_sample_dose | negcon |
CDRPBIO-BBBC036-Bray | Metadata_Sample_Dose | pert_sample_dose | negcon |
TA-ORF-BBBC037-Rohban | Metadata_broad_sample | pert_id | negcon |
LUAD-BBBC041-Caicedo | x_mutation_status | allele | negcon |
LINCS-Pilot1 | Metadata_pert_id_dose | pert_id_dose | negcon |
Dataset | GE | CPnormalized |
CPnormalized_variable_selected |
---|---|---|---|
CDRP | 977 | 1565 | 727 |
CDRP-BIO | 977 | 1570 | 601 |
LUAD | 978 | 1569 | 291 |
TA-ORF | 978 | 1677 | 63 |
LINCS | 978 | 1670 | 119 |
We license the data, results, and figures as CC0 1.0 and the source code as BSD 3-Clause.