broadinstitute / 2015_Bray_GigaScience-data

Bray GigaScience data
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

Generate treatment-level spherized profiles for 30k Cell Painting Dataset #7

Open shntnu opened 1 year ago

shntnu commented 1 year ago

Background

Several collaborators have expressed interest in having treatment-level spherized (batch-corrected) profiles for all 30k compounds in the Bray et al. Cell Painting dataset. This would help researchers working on various projects, particularly machine learning engineers who might not be familiar with batch corrections.

Task

  1. Use the following two notebooks as a reference to generate spherized profiles:

  2. Apply sphering and mean aggregation (batch correction) to obtain treatment-level profiles (one entry per compound per dose) for all 30k compounds in the original Bray et al. dataset.

  3. Publish the resulting dataset in a machine learning friendly format, including SMILES.

Related information

(gpt summary)

Nikita: We have morphology data for the paper on Zenodo (https://zenodo.org/record/7729583). The pipeline used is sphering -> mean aggregation. We did not save batch-corrected well-level profiles for this study.

Srijit: Found "mobc.npz" file in Zenodo repository (https://zenodo.org/record/7729583) but no list of Cell Painting features. Asks if there is a place to find the ordered list of features.

Nikita: The list of features is not included in Zenodo. Directs to attached file and GitHub repository for feature list (https://github.com/carpenterlab/2023_Moshkov_NatComm/blob/main/analysis/01-PUMA_feature_transform.ipynb).

Juan: Features batch-corrected with sphering technique don't have names, as they're linear combinations of non-corrected features.

Srijit: After applying spherize function in pycytominer, dataset has same feature names. Asks if feature values only have same length but not same names.

Juan: Corrected features don't correspond to original names, like reweighted principal components. Recommends opening an issue in pycytominer to anonymize features after sphering.

David-Araripe commented 1 year ago

Hi @shntnu, I did this by following the notebook Notebook 2: lincs-cell-painting - spherize-batch-effects you indicated.

However, I was wondering if you have any recommendations on the other parameters in the pycytominer.normalize method. For example, the spherize_epsilon. Should I just leave it as the default of 1e-6? Also, with regard to the interpretability of the output features. Do both ZCA-cor and ZCA conserve the meaning of the input cell painting features?

Just for the record, this is the script I'm using to generate the spherized profiles:

import json
import re
from itertools import product
from pathlib import Path

import numpy as np
import pandas as pd
from pycytominer import feature_select, normalize
from pycytominer.cyto_utils import infer_cp_features, output

def assignCompoundData(df: pd.DataFrame):
    pattern = re.compile(r"Cells|Cytoplasm|Nuclei|Image|Metadata")
    plate_map_root = Path("metadata/platemaps/CDRP/platemap")
    plate_map_dict = (
        pd.read_csv(plate_map_root.parent / "barcode_platemap.csv")
        .set_index("Assay_Plate_Barcode")["Plate_Map_Name"]
        .to_dict()
    )
    cpds_df = pd.read_csv("metadata/external_metadata/cdrp_compounds.tsv", sep="\t")
    all_layouts = []
    for plate_id in df["Metadata_Plate"].unique():
        all_layouts.append(
            pd.read_csv(
                plate_map_root / f"{plate_map_dict[plate_id]}.txt", sep="\t"
            ).rename(columns={"well_position": "Metadata_Well"})
        )
    plate_layout = pd.concat(all_layouts, ignore_index=True).drop_duplicates()
    plate_layout = plate_layout.merge(cpds_df, on="broad_sample", how="left")
    # rename columns for better readability
    plate_layout = plate_layout.rename(
        columns={
            x: f"Metadata_{x}" if not pattern.findall(x) else x
            for x in plate_layout.columns
        }
    )
    plate_layout_meta = infer_cp_features(plate_layout, metadata=True)
    df_meta = infer_cp_features(df, metadata=True)
    diff_cols = np.setdiff1d(plate_layout_meta, df_meta).tolist()
    if len(diff_cols) > 0:
        df = df.merge(plate_layout, on=diff_cols, how="left")
    return df

in_dir = Path("profiles/CDRP/")

suffixes = [
    "_normalized.csv.gz",
    "_normalized_feature_select_all.csv.gz",
    "_normalized_feature_select_negcon_all.csv.gz",
    "_normalized_negcon.csv.gz",
]

output_suffixes = {
    "_normalized.csv.gz": "whole_plate.csv.gz",
    "_normalized_feature_select_all.csv.gz": "whole_plate_feature_select.csv.gz",
    "_normalized_feature_select_negcon_all.csv.gz": "dmso_feature_select.csv.gz",
    "_normalized_negcon.csv.gz": "dmso.csv.gz",
}

feature_select_ops = [
    "variance_threshold",
    # "correlation_threshold",
    "drop_na_columns",
    "blocklist",
]

sphering_algorithms = [
    "ZCA",
    "ZCA-cor",
]

na_cut = 0
corr_threshold = 0.95
freq_cut = 0.03  # for low variance filter
output_dir = "sphere_profiles"
full_blocklist_file = Path("consensus_blocklist.txt")

plate_batches = (
    pd.read_csv("metadata/platemaps/CDRP/barcode_platemap.csv")
    .assign(Assay_Plate_Barcode=lambda x: x["Assay_Plate_Barcode"].astype(str))
    .groupby("Plate_Map_Name")
    .agg({"Assay_Plate_Barcode": lambda x: x.tolist()})["Assay_Plate_Barcode"]
    .to_dict()
)

for suffix, algorithm in product(suffixes, sphering_algorithms):
    # Read a sample dataframe and get the id and the feature columns
    sample_plates = plates = list(plate_batches.values())[0]
    sample_file = pd.read_csv(  # Just to read the column names
        in_dir / f"{sample_plates[0]}/{sample_plates[0] + suffix}", nrows=1
    )
    all_features = infer_cp_features(sample_file, metadata=False)
    standard_metadata = infer_cp_features(sample_file, metadata=True)

    for alg in sphering_algorithms:
        # Get all the plates to process
        plates_to_process = []
        for layout_id, plates in plate_batches.items():
            plate_paths = [in_dir / f"{p}/{p + suffix}" for p in plates]
            plates_to_process.extend(plate_paths)
        output_file = Path(
            f"{output_dir}/{algorithm}_dmso_spherized_profiles"
            f"_with_input_normalized_by_{output_suffixes[suffix]}"
        )
        if output_file.exists():
            continue
        profile_df = pd.concat(
            [
                reduce_mem_usage(
                    pd.read_csv(
                        p,
                        usecols=(standard_metadata + all_features),
                        low_memory=False,
                    )
                )
                for p in plates_to_process
            ],
            ignore_index=True,
        ).reset_index(drop=True)
        print(profile_df.shape)

        # Step 0: Assign compound data if they're not already there and reorder cols
        profile_df = assignCompoundData(profile_df)

        metadata_cols = infer_cp_features(profile_df, metadata=True)
        feature_cols = infer_cp_features(profile_df, metadata=False)
        profile_df = profile_df[metadata_cols + feature_cols]

        # Step 1: Perform feature selection
        profile_df = feature_select(
            profiles=profile_df,
            operation=feature_select_ops,
            na_cutoff=na_cut,
            corr_threshold=corr_threshold,
            blocklist_file=full_blocklist_file,
        )
        # Step 2: Spherize transform
        spherize_df = normalize(
            profiles=profile_df,
            features="infer",
            meta_features="infer",
            samples="Metadata_cpd_name == 'DMSO'",
            method="spherize",
            spherize_method=algorithm,
        )
        print(spherize_df.shape)
        # Step 3: Output profiles
        output(df=spherize_df, output_filename=output_file)
        # write out batches_dict to the output dir
        if not Path(f"{output_dir}/batches_dict.json").exists():
            with open(f"{output_dir}/batches_dict.json", "w") as f:
                json.dump(plate_batches, f)
Where plate_batchesis a dictionary created from metadata/platemaps/CDRP/barcode_platemap.csv. The data frame looks like this: Plate_Map_Name Assay_Plate_Barcode
0 H-BIOA-004-3 24277
1 H-BIOA-007-3 24278

I then organize so that I have the Plate_Map_Name as a key and the respective assay plates as values so I process them together. plate_batches then looks like this: 'C-2113-01-CMP-002': ['25739', '25740', '25741', '25742']

And cpds_df within assignCompoundData looks like the following: broad_sample cpd_name cpd_name_type cpd_sample_id pert_type control_type dos_library source_name chemist_name vendor_catalog_id user_comment smiles
0 BRD-A00037023-001-05-1 BRD-A00037023 BROAD_CPD_ID SA1855877 trt nan nan ChemDiv Inc. nan 8016-0939 nan CCOC(=O)C1=C(C)N=C2SC(=CC(=O)N2C1c1ccc(OC)cc1)C(=O)OC
1 BRD-A00051892-001-05-0 BRD-A00051892 BROAD_CPD_ID SA1852153 trt nan nan ChemBridge Corporation nan 7537609 nan N#Cc1nc(oc1NCC1CCCO1)-c1cccs1
shntnu commented 11 months ago

@David-Araripe – sorry for the long gaps between responses.

Just a heads up that @johnarevalo will update us on the final pipeline he will use in the updated version of https://github.com/carpenter-singh-lab/2023_Arevalo_BatchCorrection; you can then use the same pipeline to reprocess this data (aggregated profiles --> batch corrected profiles), assuming it is still relevant to you at that time.