Open shntnu opened 1 year ago
Hi @shntnu, I did this by following the notebook Notebook 2: lincs-cell-painting - spherize-batch-effects you indicated.
However, I was wondering if you have any recommendations on the other parameters in the pycytominer.normalize
method. For example, the spherize_epsilon
. Should I just leave it as the default of 1e-6
? Also, with regard to the interpretability of the output features. Do both ZCA-cor
and ZCA
conserve the meaning of the input cell painting features?
Just for the record, this is the script I'm using to generate the spherized profiles:
import json
import re
from itertools import product
from pathlib import Path
import numpy as np
import pandas as pd
from pycytominer import feature_select, normalize
from pycytominer.cyto_utils import infer_cp_features, output
def assignCompoundData(df: pd.DataFrame):
pattern = re.compile(r"Cells|Cytoplasm|Nuclei|Image|Metadata")
plate_map_root = Path("metadata/platemaps/CDRP/platemap")
plate_map_dict = (
pd.read_csv(plate_map_root.parent / "barcode_platemap.csv")
.set_index("Assay_Plate_Barcode")["Plate_Map_Name"]
.to_dict()
)
cpds_df = pd.read_csv("metadata/external_metadata/cdrp_compounds.tsv", sep="\t")
all_layouts = []
for plate_id in df["Metadata_Plate"].unique():
all_layouts.append(
pd.read_csv(
plate_map_root / f"{plate_map_dict[plate_id]}.txt", sep="\t"
).rename(columns={"well_position": "Metadata_Well"})
)
plate_layout = pd.concat(all_layouts, ignore_index=True).drop_duplicates()
plate_layout = plate_layout.merge(cpds_df, on="broad_sample", how="left")
# rename columns for better readability
plate_layout = plate_layout.rename(
columns={
x: f"Metadata_{x}" if not pattern.findall(x) else x
for x in plate_layout.columns
}
)
plate_layout_meta = infer_cp_features(plate_layout, metadata=True)
df_meta = infer_cp_features(df, metadata=True)
diff_cols = np.setdiff1d(plate_layout_meta, df_meta).tolist()
if len(diff_cols) > 0:
df = df.merge(plate_layout, on=diff_cols, how="left")
return df
in_dir = Path("profiles/CDRP/")
suffixes = [
"_normalized.csv.gz",
"_normalized_feature_select_all.csv.gz",
"_normalized_feature_select_negcon_all.csv.gz",
"_normalized_negcon.csv.gz",
]
output_suffixes = {
"_normalized.csv.gz": "whole_plate.csv.gz",
"_normalized_feature_select_all.csv.gz": "whole_plate_feature_select.csv.gz",
"_normalized_feature_select_negcon_all.csv.gz": "dmso_feature_select.csv.gz",
"_normalized_negcon.csv.gz": "dmso.csv.gz",
}
feature_select_ops = [
"variance_threshold",
# "correlation_threshold",
"drop_na_columns",
"blocklist",
]
sphering_algorithms = [
"ZCA",
"ZCA-cor",
]
na_cut = 0
corr_threshold = 0.95
freq_cut = 0.03 # for low variance filter
output_dir = "sphere_profiles"
full_blocklist_file = Path("consensus_blocklist.txt")
plate_batches = (
pd.read_csv("metadata/platemaps/CDRP/barcode_platemap.csv")
.assign(Assay_Plate_Barcode=lambda x: x["Assay_Plate_Barcode"].astype(str))
.groupby("Plate_Map_Name")
.agg({"Assay_Plate_Barcode": lambda x: x.tolist()})["Assay_Plate_Barcode"]
.to_dict()
)
for suffix, algorithm in product(suffixes, sphering_algorithms):
# Read a sample dataframe and get the id and the feature columns
sample_plates = plates = list(plate_batches.values())[0]
sample_file = pd.read_csv( # Just to read the column names
in_dir / f"{sample_plates[0]}/{sample_plates[0] + suffix}", nrows=1
)
all_features = infer_cp_features(sample_file, metadata=False)
standard_metadata = infer_cp_features(sample_file, metadata=True)
for alg in sphering_algorithms:
# Get all the plates to process
plates_to_process = []
for layout_id, plates in plate_batches.items():
plate_paths = [in_dir / f"{p}/{p + suffix}" for p in plates]
plates_to_process.extend(plate_paths)
output_file = Path(
f"{output_dir}/{algorithm}_dmso_spherized_profiles"
f"_with_input_normalized_by_{output_suffixes[suffix]}"
)
if output_file.exists():
continue
profile_df = pd.concat(
[
reduce_mem_usage(
pd.read_csv(
p,
usecols=(standard_metadata + all_features),
low_memory=False,
)
)
for p in plates_to_process
],
ignore_index=True,
).reset_index(drop=True)
print(profile_df.shape)
# Step 0: Assign compound data if they're not already there and reorder cols
profile_df = assignCompoundData(profile_df)
metadata_cols = infer_cp_features(profile_df, metadata=True)
feature_cols = infer_cp_features(profile_df, metadata=False)
profile_df = profile_df[metadata_cols + feature_cols]
# Step 1: Perform feature selection
profile_df = feature_select(
profiles=profile_df,
operation=feature_select_ops,
na_cutoff=na_cut,
corr_threshold=corr_threshold,
blocklist_file=full_blocklist_file,
)
# Step 2: Spherize transform
spherize_df = normalize(
profiles=profile_df,
features="infer",
meta_features="infer",
samples="Metadata_cpd_name == 'DMSO'",
method="spherize",
spherize_method=algorithm,
)
print(spherize_df.shape)
# Step 3: Output profiles
output(df=spherize_df, output_filename=output_file)
# write out batches_dict to the output dir
if not Path(f"{output_dir}/batches_dict.json").exists():
with open(f"{output_dir}/batches_dict.json", "w") as f:
json.dump(plate_batches, f)
Where plate_batches is a dictionary created from metadata/platemaps/CDRP/barcode_platemap.csv . The data frame looks like this: |
Plate_Map_Name | Assay_Plate_Barcode | |
---|---|---|---|
0 | H-BIOA-004-3 | 24277 | |
1 | H-BIOA-007-3 | 24278 |
I then organize so that I have the Plate_Map_Name
as a key and the respective assay plates as values so I process them together. plate_batches
then looks like this: 'C-2113-01-CMP-002': ['25739', '25740', '25741', '25742']
And cpds_df within assignCompoundData looks like the following: |
broad_sample | cpd_name | cpd_name_type | cpd_sample_id | pert_type | control_type | dos_library | source_name | chemist_name | vendor_catalog_id | user_comment | smiles | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | BRD-A00037023-001-05-1 | BRD-A00037023 | BROAD_CPD_ID | SA1855877 | trt | nan | nan | ChemDiv Inc. | nan | 8016-0939 | nan | CCOC(=O)C1=C(C)N=C2SC(=CC(=O)N2C1c1ccc(OC)cc1)C(=O)OC | |
1 | BRD-A00051892-001-05-0 | BRD-A00051892 | BROAD_CPD_ID | SA1852153 | trt | nan | nan | ChemBridge Corporation | nan | 7537609 | nan | N#Cc1nc(oc1NCC1CCCO1)-c1cccs1 |
@David-Araripe – sorry for the long gaps between responses.
Just a heads up that @johnarevalo will update us on the final pipeline he will use in the updated version of https://github.com/carpenter-singh-lab/2023_Arevalo_BatchCorrection; you can then use the same pipeline to reprocess this data (aggregated profiles --> batch corrected profiles), assuming it is still relevant to you at that time.
Background
Several collaborators have expressed interest in having treatment-level spherized (batch-corrected) profiles for all 30k compounds in the Bray et al. Cell Painting dataset. This would help researchers working on various projects, particularly machine learning engineers who might not be familiar with batch corrections.
Task
Use the following two notebooks as a reference to generate spherized profiles:
Apply sphering and mean aggregation (batch correction) to obtain treatment-level profiles (one entry per compound per dose) for all 30k compounds in the original Bray et al. dataset.
Publish the resulting dataset in a machine learning friendly format, including SMILES.
Related information
(gpt summary)
Nikita: We have morphology data for the paper on Zenodo (https://zenodo.org/record/7729583). The pipeline used is sphering -> mean aggregation. We did not save batch-corrected well-level profiles for this study.
Srijit: Found "mobc.npz" file in Zenodo repository (https://zenodo.org/record/7729583) but no list of Cell Painting features. Asks if there is a place to find the ordered list of features.
Nikita: The list of features is not included in Zenodo. Directs to attached file and GitHub repository for feature list (https://github.com/carpenterlab/2023_Moshkov_NatComm/blob/main/analysis/01-PUMA_feature_transform.ipynb).
Juan: Features batch-corrected with sphering technique don't have names, as they're linear combinations of non-corrected features.
Srijit: After applying spherize function in pycytominer, dataset has same feature names. Asks if feature values only have same length but not same names.
Juan: Corrected features don't correspond to original names, like reweighted principal components. Recommends opening an issue in pycytominer to anonymize features after sphering.