Generate treatment-level spherized profiles for 30k Cell Painting Dataset

Background

Several collaborators have expressed interest in having treatment-level spherized (batch-corrected) profiles for all 30k compounds in the Bray et al. Cell Painting dataset. This would help researchers working on various projects, particularly machine learning engineers who might not be familiar with batch corrections.

Task

Use the following two notebooks as a reference to generate spherized profiles:
- Notebook 1: PUMA_feature_transform
- Notebook 2: lincs-cell-painting - spherize-batch-effects
Apply sphering and mean aggregation (batch correction) to obtain treatment-level profiles (one entry per compound per dose) for all 30k compounds in the original Bray et al. dataset.
Publish the resulting dataset in a machine learning friendly format, including SMILES.

Related information

(gpt summary)

Nikita: We have morphology data for the paper on Zenodo (https://zenodo.org/record/7729583). The pipeline used is sphering -> mean aggregation. We did not save batch-corrected well-level profiles for this study.

Srijit: Found "mobc.npz" file in Zenodo repository (https://zenodo.org/record/7729583) but no list of Cell Painting features. Asks if there is a place to find the ordered list of features.

Nikita: The list of features is not included in Zenodo. Directs to attached file and GitHub repository for feature list (https://github.com/carpenterlab/2023_Moshkov_NatComm/blob/main/analysis/01-PUMA_feature_transform.ipynb).

Juan: Features batch-corrected with sphering technique don't have names, as they're linear combinations of non-corrected features.

Srijit: After applying spherize function in pycytominer, dataset has same feature names. Asks if feature values only have same length but not same names.

Juan: Corrected features don't correspond to original names, like reweighted principal components. Recommends opening an issue in pycytominer to anonymize features after sphering.

Hi @shntnu, I did this by following the notebook Notebook 2: lincs-cell-painting - spherize-batch-effects you indicated.

However, I was wondering if you have any recommendations on the other parameters in the pycytominer.normalize method. For example, the spherize_epsilon. Should I just leave it as the default of 1e-6? Also, with regard to the interpretability of the output features. Do both ZCA-cor and ZCA conserve the meaning of the input cell painting features?

Just for the record, this is the script I'm using to generate the spherized profiles:

import json
import re
from itertools import product
from pathlib import Path

import numpy as np
import pandas as pd
from pycytominer import feature_select, normalize
from pycytominer.cyto_utils import infer_cp_features, output

def assignCompoundData(df: pd.DataFrame):
    pattern = re.compile(r"Cells|Cytoplasm|Nuclei|Image|Metadata")
    plate_map_root = Path("metadata/platemaps/CDRP/platemap")
    plate_map_dict = (
        pd.read_csv(plate_map_root.parent / "barcode_platemap.csv")
        .set_index("Assay_Plate_Barcode")["Plate_Map_Name"]
        .to_dict()
    )
    cpds_df = pd.read_csv("metadata/external_metadata/cdrp_compounds.tsv", sep="\t")
    all_layouts = []
    for plate_id in df["Metadata_Plate"].unique():
        all_layouts.append(
            pd.read_csv(
                plate_map_root / f"{plate_map_dict[plate_id]}.txt", sep="\t"
            ).rename(columns={"well_position": "Metadata_Well"})
        )
    plate_layout = pd.concat(all_layouts, ignore_index=True).drop_duplicates()
    plate_layout = plate_layout.merge(cpds_df, on="broad_sample", how="left")
    # rename columns for better readability
    plate_layout = plate_layout.rename(
        columns={
            x: f"Metadata_{x}" if not pattern.findall(x) else x
            for x in plate_layout.columns
        }
    )
    plate_layout_meta = infer_cp_features(plate_layout, metadata=True)
    df_meta = infer_cp_features(df, metadata=True)
    diff_cols = np.setdiff1d(plate_layout_meta, df_meta).tolist()
    if len(diff_cols) > 0:
        df = df.merge(plate_layout, on=diff_cols, how="left")
    return df

in_dir = Path("profiles/CDRP/")

suffixes = [
    "_normalized.csv.gz",
    "_normalized_feature_select_all.csv.gz",
    "_normalized_feature_select_negcon_all.csv.gz",
    "_normalized_negcon.csv.gz",
]

output_suffixes = {
    "_normalized.csv.gz": "whole_plate.csv.gz",
    "_normalized_feature_select_all.csv.gz": "whole_plate_feature_select.csv.gz",
    "_normalized_feature_select_negcon_all.csv.gz": "dmso_feature_select.csv.gz",
    "_normalized_negcon.csv.gz": "dmso.csv.gz",
}

feature_select_ops = [
    "variance_threshold",
    # "correlation_threshold",
    "drop_na_columns",
    "blocklist",
]

sphering_algorithms = [
    "ZCA",
    "ZCA-cor",
]

na_cut = 0
corr_threshold = 0.95
freq_cut = 0.03  # for low variance filter
output_dir = "sphere_profiles"
full_blocklist_file = Path("consensus_blocklist.txt")

plate_batches = (
    pd.read_csv("metadata/platemaps/CDRP/barcode_platemap.csv")
    .assign(Assay_Plate_Barcode=lambda x: x["Assay_Plate_Barcode"].astype(str))
    .groupby("Plate_Map_Name")
    .agg({"Assay_Plate_Barcode": lambda x: x.tolist()})["Assay_Plate_Barcode"]
    .to_dict()
)

for suffix, algorithm in product(suffixes, sphering_algorithms):
    # Read a sample dataframe and get the id and the feature columns
    sample_plates = plates = list(plate_batches.values())[0]
    sample_file = pd.read_csv(  # Just to read the column names
        in_dir / f"{sample_plates[0]}/{sample_plates[0] + suffix}", nrows=1
    )
    all_features = infer_cp_features(sample_file, metadata=False)
    standard_metadata = infer_cp_features(sample_file, metadata=True)

    for alg in sphering_algorithms:
        # Get all the plates to process
        plates_to_process = []
        for layout_id, plates in plate_batches.items():
            plate_paths = [in_dir / f"{p}/{p + suffix}" for p in plates]
            plates_to_process.extend(plate_paths)
        output_file = Path(
            f"{output_dir}/{algorithm}_dmso_spherized_profiles"
            f"_with_input_normalized_by_{output_suffixes[suffix]}"
        )
        if output_file.exists():
            continue
        profile_df = pd.concat(
            [
                reduce_mem_usage(
                    pd.read_csv(
                        p,
                        usecols=(standard_metadata + all_features),
                        low_memory=False,
                    )
                )
                for p in plates_to_process
            ],
            ignore_index=True,
        ).reset_index(drop=True)
        print(profile_df.shape)

        # Step 0: Assign compound data if they're not already there and reorder cols
        profile_df = assignCompoundData(profile_df)

        metadata_cols = infer_cp_features(profile_df, metadata=True)
        feature_cols = infer_cp_features(profile_df, metadata=False)
        profile_df = profile_df[metadata_cols + feature_cols]

        # Step 1: Perform feature selection
        profile_df = feature_select(
            profiles=profile_df,
            operation=feature_select_ops,
            na_cutoff=na_cut,
            corr_threshold=corr_threshold,
            blocklist_file=full_blocklist_file,
        )
        # Step 2: Spherize transform
        spherize_df = normalize(
            profiles=profile_df,
            features="infer",
            meta_features="infer",
            samples="Metadata_cpd_name == 'DMSO'",
            method="spherize",
            spherize_method=algorithm,
        )
        print(spherize_df.shape)
        # Step 3: Output profiles
        output(df=spherize_df, output_filename=output_file)
        # write out batches_dict to the output dir
        if not Path(f"{output_dir}/batches_dict.json").exists():
            with open(f"{output_dir}/batches_dict.json", "w") as f:
                json.dump(plate_batches, f)

Where `plate_batches`is a dictionary created from `metadata/platemaps/CDRP/barcode_platemap.csv`. The data frame looks like this:		Plate_Map_Name	Assay_Plate_Barcode
0	H-BIOA-004-3	24277
1	H-BIOA-007-3	24278

I then organize so that I have the Plate_Map_Name as a key and the respective assay plates as values so I process them together. plate_batches then looks like this: 'C-2113-01-CMP-002': ['25739', '25740', '25741', '25742']

And `cpds_df` within `assignCompoundData` looks like the following:		broad_sample	cpd_name	cpd_name_type	cpd_sample_id	pert_type	control_type	dos_library	source_name	chemist_name	vendor_catalog_id	user_comment	smiles
0	BRD-A00037023-001-05-1	BRD-A00037023	BROAD_CPD_ID	SA1855877	trt	nan	nan	ChemDiv Inc.	nan	8016-0939	nan	CCOC(=O)C1=C(C)N=C2SC(=CC(=O)N2C1c1ccc(OC)cc1)C(=O)OC
1	BRD-A00051892-001-05-0	BRD-A00051892	BROAD_CPD_ID	SA1852153	trt	nan	nan	ChemBridge Corporation	nan	7537609	nan	N#Cc1nc(oc1NCC1CCCO1)-c1cccs1

broadinstitute / 2015_Bray_GigaScience-data