Developer: Elizabeth Aslinger (easlinger)
Correspondence: elizabeth.aslinger@aya.yale.edu
Open a Unix terminal (often Ctrl + Alt + T on Linux).
Install conda environment from .yml file (replace "env-corescpy"
with desired environment name):
conda create -n corescpy python=3.10.4 # create python environment
Activate the conda environment with conda activate corescpy
.
Clone the repository to your local computer:
git clone git@github.com:ChoBioLab/corescpy.git
,
git clone https://github.com/ChoBioLab/corescpy.git
, or
look above for the green "Code" button and press it for instructions.
Navigate to the repository directory (replace cd <DIRECTORY>
Install
Install the package with pip. (Ensure you have pip installed.)
pip install .
If you have issues with resolving/finding the most up-to-date version of the spatialdata
and/or spatialdata-io
packages, try running:
pip install git+https://github.com/scverse/spatialdata
pip install git+https://github.com/scverse/spatialdata-io
in your terminal while in your conda environment, then re-try step (6). If you have an M1 Mac, see this thread about known compatibility issues with pertpy
if you have issues with the install.
If you're planning to use this environment with Jupyter notebooks, run conda install nb_conda_kernels
, then pip install ipykernel
.
If you have issues importing modules or functions (particularly if it only happens if you don't run the import after launching python
while you are in the corescpy
directory), try mv <CONDA_ENV_PATH>/site-packages/_corescpy.pth <CONDA_ENV_PATH>/_corescpy.pth.bak
(replacing /home/elizabeth/elizabeth/miniconda3/envs/corescpy/lib/python3.10/site-packages
), then pip uninstall corescpy
then cd corescpy
(replace "corescpy" with path to your corescpy top-level directory if needed) then pip install -e .
. Then try cd
to return to your home directory, then python -c "import corescpy; print(dir(corescpy))"
from your terminal. Make sure it prints out submodules (e.g., analysis
), and not just the base attributes (e.g., __doc__
).
** Note: To use GPU resources, use conda install -c rapidsai -c nvidia -c conda-forge cugraph cuml cudf
and install the gpu version of coreSCpy (which should pip install scanpy[rapids]
).
** Tip: If you run out of space, run:
pip cache purge
conda clean -i
conda clean -t
If you have issues seeing the environment when choosing the kernel for your Jupyter notebook:
conda install nb_conda_kernels
You can now load corescpy
like any other distributed Python package.
Open a Python terminal and type:
import corescpy as cr
You can now call functions from the analysis module using
cr.ax.<FUNCTION>()
, from the preprocessing using cr.ax.pp...
, etc.
in Python; however, you are most likely to interact with the Omics
class object, or specialized classes that inherit from it, such as Crispr
and Spatial
.
class object. Here is example code you might run (replacing things in < > brackets with your specifications):
self = cr.Omics(<data_object_or_directory>, <...>)
or
self = cr.Crispr(<data_object_or_directory>, <...>)
or
self = cr.Spatial(<data_object_or_directory>, <...>)
and then run workflows, such as
self.preprocess(<...>)
self.cluster(<...>)
self.annotate_clusters("<CellTypist model.pkl>")
self.plot(kind=["heat", "matrix", "umap"])
etc.
Here are the methods (applicable to scRNA-seq generally, not just perturbations) in order of a typical workflow (replace ... with argument specifications):
self.preprocess(...)
: Perform filtering, normalization, scaling, quality control analysis, selection for highly variable genes, regressing out confounds, and intial exploration of the data (e.g., cell counts, mitochondrial counts, etc.).self.cluster(...)
: Perform dimensionality reduction (e.g., PCA) and clustering (Louvain or Leiden), as well as related visualization.self.plot(...)
: Create additional basic plots (e.g., dot, matrix, and violin gene expression plots). Use self.plot_compare(...)
to make plots comparing gene expression across groups.The following perturbation-specific methods can be executed optionally and in any order:
Here is an example workflow to analyze spatial data (after preprocessing and clustering as described above):
self.calculate_centrality(n_jobs=4)
self.find_cooccurrence(figsize=(60, 20), kws_plot=dict(wspace=3))
self.find_svgs(genes=genes, method="moran", n_perms=10, kws_plot=dict(
legend_fontsize="large"), figsize=(15, 15))
self.calculate_receptor_ligand(col_condition=False, p_threshold=0.001,
remove_ns=True, figsize=(20, 20))
self.run_augur(...)
: Score and plot how strongly different cell types responded to perturbation(s). This score is operationalized as the accuracy with which a machine learning model can use gene expression data to predict the perturbation condition to which cells of a given type belong. Augur provides scores aggregated across cells of a given type rather than for individual cells.self.run_mixscape(...)
: Quantify and plot the extent to which individual cells responded to CRISPR perturbation(s), and identify which perturbation condition cells were not detectibly perturbed in terms of their gene expression.self.compute_distance(...)
: Calculate and visualize various distance metrics that quantify the similarity in gene expression profiles across perturbation conditions.self.run_composition_analysis(...)
: Analyze and visualize shifts in the cell type composition across perturbation conditions.self.run_dialogue(...)
: Create plots showing multi-cellular programs.Argument Conventions:
Certain arguments used throughout the corescpy
package (including outside the corescpy.crispr_class.Crispr()
class), hold to conventions intended to foster readability, maintain consistency, and promote clarity, both for end-users and future developers.
col_
and key_
.adata.obs
or .adata.var
), while the "key" prefix means you're meant to specify a type of entry in a column. For instance, assume the column "condition" contains the names of different experimental conditions (drug A, drug B, drug C). In a function where you want to compare, for instance, drug A vs. control, you would specify key_treatment="drug A"
and k.adata
attribute immediately upon creating the AnnData object from the data file) orcol_control
corresponding to cells that have control guide RNA IDs in col_guide_rna
), for aesthetics, customizability to your design/interpretability, and/or to avoid duplicating pre-existing names.._columns
and ._keys
, respectively.run_dialogue()
method, you can specify in that method (run_dialogue(col_cell_type="majority_voting")
) without changing the attribute (self._columns
) that contains your original specification here.col_perturbed
(binary) vs. col_condition
(can have >= 3 categories)
Crispr
class object, col_perturbed
is meant to be a binary column that has key_control
as the entry for control rows and key_treatment
for all other experimental conditions.
col_perturbed
would contain only key_treatment
(i.e., all perturbed cells, regardless of targeted gene) while col_condition
would contain entries specifying the particular gene(s) targeted (or key_control
).col_perturbed
would contain only key_treatment
(i.e., all perturbed cells, regardless of targeted gene) while col_condition
would contain entries specifying the particular gene(s) targeted (or key_control
).Crispr
class object, it is created during object initialization as a column (named after your specification of col_perturbed
) in .obs
. All rows in .obs[col_condition]
that do not = key_control
will be set as key_treatment
.corescpy
package more broadly, if a function calls for a col_perturbed
argument, that indicates that it works with binary categories only. If it is fed a column with three or more categories, it will either subset the data to include only rows where that column = key_treatment
or key_control
(desirable behavior if you want to compare only a subset of the existing conditions, but undesirable if you want to look at, say, any drug vs. control, where the desired "drug" category consists of rows where the column = "drug A" and "drug B"), or it will throw an error.col_guide_rna
file_path
(str, AnnData, or dictionary): Path or object containing data. Used in initialization to create the initial self.adata
attribute (an AnnData or MuData object). Either
AnnData
, MuData
, or SpatialData
object (e.g., already loaded with the appropriate scverse
packages, or by using corescpy.pp.create_object(file_path)
), orcorescpy.pp.combine_matrix_protospacer()
(in order to load information about perturbations from other file(s); press the arrow to expand details here),or
col_sample_id
) consisting of whatever objects you would pass to create_object()
's file
argument for the individual objects. You must also specify col_sample
(a tuple as described in the documentation below). The other arguments passed to the corescpy.pp.create_object()
function (e.g., col_gene_symbols
) can be specified as normal if they are common across samples; otherwise, specify them as lists in the same order as the file
dictionary.assay
(str, optional): Name of the gene expression assay if loading a multi-modal data object (e.g., "rna"). Defaults to None (i.e., self.adata
is single-modal).
assay_protein
(str, optional): Name of the assay containing the protein expression modality, if available. For instance, if "adt", self.adata["adt"]
would be expected to contain the AnnData object for the protein modality. ONLY FOR MULTI-MODAL DATA for certain bonus visualization methods. Defaults to None.
col_gene_symbols
(str, optional): Column name in .var
for gene symbols. Defaults to "gene_symbols".
col_cell_type
(str, optional): Column name in .obs
for cell type. Defaults to "leiden" (anticipating that you will run self.cluster(...)
with method_cluster="leiden"
). This column may be
Crispr.cluster()
.col_sample_id
(str or tuple, optional): Column in .obs
with sample IDs. Defaults to "standard_sample_id". If this column does not yet exist in your data and needs to be created by concatenating datasets, you must provide file_path
as a dictionary keyed by desired col_sample_id
values as well as signal that this needs to happen by specifying col_sample_id as a tuple, with the second element containing a dictionary of keyword arguments to pass to AnnData.concatenate()
or None (to use defaults).
col_batch
(str, optional): Column in .obs
with batch IDs. Defaults to None.
col_condition
(str, optional): Either the name of an existing column in .obs
indicating the experimental condition to which each cell belongs or (for CRISPR designs) the desired name of the column that will be created from col_guide_rna
to indicate the gene(s) targeted in each cell.
col_perturbed
(a binary treated/perturbed vs. control indicator).col_guide_rna
is specified), this column will be where each guide RNA's target gene will be stored, whether pre-existing (copied directly from col_guide_rna
if kws_process_guide_rna
is None) or created during the Crispr object initialization by passing col_guide_rna
and kws_process_guide_rna
to corescpy.pp.filter_by_guide_counts()
in order to convert particular guide RNA IDs to their target(s) (e.g., STAT1-1|CCL8-2-1|NegCtrl32a => STAT1|CCL8|Control). Defaults to None.key_control
, but it can have multiple categories of other entries that all translate to key_treatment
in col_perturbed
.file_path
argument of the Crispr
class initialization method after adding a separate column with a name different from those specified in any of the other column arguments. You can then pass that column name manually to certain functions' col_control
arguments and specify the particular control condition in key_control
.key_control
and key_treatment
, as well as col_perturbed
or col_condition
(for methods that don't require binary labeling), can be newly-specified so that you can compare different conditions within this column. If the argument is named col_perturbed
, passing a column with more than two categories usually results in subsetting the data to compare only the two conditions specified in key_treatment
and key_control
. The exception is where there is a key_treatment_list
or similarly-named argument.col_perturbed
(str, optional): Column in .obs
where class methods will be able to find the binary experimental condition variable. It will be created during Crispr
object initialization as a binary version of col_condition
. Defaults to "perturbation". For CRISPR designs, all entries containing the patterns specified in kws_process_guide_rna["key_control_patterns"]
will be changed to key_control
, and all cells with targeting guides will be changed to key_treatment
.col_guide_rna
(str, optional): Column in .obs
with guide RNA IDs. Defaults to "guide_ids". This column should always be specified for CRISPR designs and should NOT be specified for other designs.
key_treatment
should be the name of the gene targeted. Then, col_condition
will be a direct copy of this column.self.adata.var_names
(or key_control
or one of the patterns in kws_process_guide_rna["key_control_patterns"]
), plus, optionally, suffixes separating guide #s (e.g., STAT1-1-2, CTRL-1) and/or with a character that splits separate guide RNAs within that cell (if multiply-transfected cells are present). These characters should be specified in kws_process_guide_rna["guide_split"]
and kws_process_guide_rna["feature_split"]
, respectively. For instance, they would be "-" and "|", if col_guide_rna
entries for a cell multiply transfected by two sgRNAs targeting STAT1, two control guide RNAs, and a guide targeting CCL5 would look like "STAT1-1-1|STAT1-1-2|CNTRL-1-1|CCL5-1".col_target_genes
, so keep that in mind.<col_guide_rna>_original
if kws_process_guide_rna
is not None, as that will result in a processed version of this column being stored under self.adata.obs[<col_guide_rna>]
.col_num_umis
(str, optional): Name of column in .obs
with the UMI counts. This should be specified if kws_process_guide_rna
is not None. For designs with multiply-transfected cells, it should follow the same convention established in kws_process_guide_rna["feature_split"]
. Defaults to "num_umis".
key_control
(str, optional): The label that is or will be in col_condition
, col_guide_rna
, and col_perturbed
indicating control rows. Defaults to "NT". Either
.obs[<col_guide_rna>]
and kws_process_guide_rna["key_control_patterns"]
) to be categorized as control rows under the new version(s) of .obs[<col_guide_rna>]
, .obs[<col_target_genes>]
, and/or .obs[<col_perturbed>]
. For instance, entries like "CNTRL-1", "NEGCNTRL", "Control", etc. in col_guide_rna
would all be keyed as "Control" in (the new versions of) col_target_genes
, col_guide_rna
, and col_perturbed
if you specify key_control="Control
and kws_process_guide_rna=dict(key_control_patterns=["CTRL", "Control"])
.key_treatment
(str, optional): What entries in col_perturbed
indicate a treatment condition (e.g., drug administration, CRISPR knock-out/down) as opposed to a control condition? This name will also be used for Mixscape classification labeling. Defaults to "KO".key_nonperturbed
(str, optional): What will be stored in the mixscape_class_global
and related columns/labels after running Mixscape methods. Indicates cells without a detectible perturbation. Defaults to "NP".
kws_process_guide_rna
(dict, optional): Dictionary of keyword arguments to pass to corescpy.pp.filter_by_guide_counts()
. (See below and corescpy.processing.preprocessing documentation). Defaults to None (no processing will take place, in which case BE SURE THAT col_target_genes
already exists in the data once loaded and contains the already-filtered, summed up, generic gene-named, etc. versions of the guide RNA column). Keys of this dictionary should be:
col_guide_rna
column include NEGCNTRL
and Control.D
, you should specify ['Control', 'CNTRL'] (assuming no non-control sgRNA names contain those patterns). If blank entries should be interpreted as control guides, then include np.nan/numpy.nan in this list. Defaults to None, which turns to [np.nan].max_percent_umis_control_drop
(int, optional): If control UMI counts are $<=$ this percentage of the total counts for that cell, and if a non-control sgRNA is also present and meets other filtering criteria, then consider that cell pseudo-single-transfected (non-control gene). Defaults to 75.min_percent_umis
(int, optional): sgRNAs with counts below this percentage will be considered noise for that guide. Defaults to 40.feature_split
(str, optional): For designs with multiple guides, the character that splits guide names in col_guide_rna
. For instance, "|" for STAT1-1|CNTRL-1|CDKN1A
. Defaults to "|". If only single guides, you should set to None.guide_split
(str, optional): The character that separates guide (rather than gene target)-specific IDs within gene. For instance, guides targeting STAT1 may include STAT1-1, STAT1-2-1, etc.; the argument would be "-" so the function can identify all of those as targeting STAT1. Defaults to "-".remove_multi_transfected
(bool, optional): In designs with multiple guides per cell, remove multiply-transfected cells (i.e., cells where more than one target guide survived
application of any filtering criteria set in kws_process_guide_rna
). If kws_process_guide_rna["max_percent_umis_control_drop"]
is greater than 0, then cells with one target guide and control guides which together make up less than max_percent_umis_control_drop
% of total UMI counts will be considered pseudo-single-transfected for the target guide. Defaults to True. Some functionality may be limited and/or problems occur if set to False and if multiply-transfected cells remain in data.The corescpy.crispr_object.Crispr()
class object is an end user's main way of interacting with the package as a whole. (See above for an overview of the workflow.) See the notebooks in /examples for additional help.
.adata
: AnnData object. Columns or other objects created in the course of running certain methods may also be stored in its various attributes. Below are listed some of the major attributes of .adata
. Note that for multi-modal data (self._assay is not None), some of these attributes may need to be accessed by .adata[self._assay].<attribute>
, but for brevity, we'll refer to .adata
here. Not all will/have to be present, except .X
, .obs
, and .var
.
.X
: Sparse matrix of data originally passed to the function to create an AnnData object (e.g., from CellRanger output)..layers
: Contains different versions of adata.X
, such as scaled data (adata.layers["scaled"]
) or that created by calculating a
Mixscape perturbation signature (adata.layers["X_pert"]
, by default)..obs
: pandas.DataFrame of observations (rows=cells). You can store additional data about individual cells/samples/etc. here by assigning a new column..obsm
: xxxxxxxxxxxxxxxxxxxxxxxxxx.obsm
: xxxxxxxxxxxxxxxxxxxxxxxxxx.var
: pandas.DataFrame of observations (rows=cells). You can store additional data about individual cells/samples/etc. here by assigning a new column.
Often contains the gene symbols and EnsemblIDs (either of which is often the index/.var_names
),
"feature_types" (e.g., "Gene Expression"), and, after preprocessing, may contain columns such as the number of cells expressing that feature ("n_cells"),
whether that feature is a mitochonrial ("mt") and/or highly variable ("highly_variable") gene, mean and total counts, percent dropout, means, dispersions,
and normalized versions of these metrics..obs_names
: Row indices of .obs
(e.g., cell barcodes). Changing this attribute changes this index, and has other potential benefits/consequences..var_names
: Row indices of .var
(i.e., gene names). Changing this attribute changes this index, and has other potential benefits/consequences..n_obs
: Number of observations (i.e., cells)..n_vars
: Number of features (i.e., genes/proteins/etc.)..results
: A dictionary with many of the dataframes, arrays, and other non-plot-based output of analyses..figures
: A dictionary with figures from output of your analyses..info
: Contains miscellaneous information about your data, the Crispr object (e.g., its analysis history), argument specifications, etc.The AnnData object is stored in the attribute adata
, so if your object is called self
, you can access it using self.adata
. (For examples going forward, we will assume the object is called self
, but you can substitute any name you want by assigning the Crispr object to some other name instead.)
If you have multiple modalities, you can access the gene expression modality using either self.adata[self._assay]
(having specified assay=the name of the RNA modality in your AnnData, which is usually "rna," in the Cripsr()
initialization method call when you first create your object) or using the alias self.rna
.
Thus, if you have multi-modal data in self.adata
, it's convenient to access the AnnData attributes specifically of your AnnData's gene expression modality using, for instance, the alias self.rna.obs
instead of the long-form self.adata[self._assay].obs
.
These aliases are not only convenient for their brevity, but also allow for a more generalizable way to call specific objects. For instance, if you wanted to write a script that frequently calls the .obs
attribute of the RNA data, and you want it to work for both uni- and multi-modal data, instead of repeatedly writing, for example:
if self._assay is None:
custom_function(self.adata[self._assay].obs)
else:
custom_function(self.adata[self._assay].obs)
you may simply say self.rna.obs
, knowing it will work whether or not multiple assays exist in the object's AnnData attribute.
Finally, this approach saves memory: All these versions of the attribute are stored in a single place in memory so you can call the attributes in various ways without duplicating them and taking up more space.