ChoBioLab / corescpy

A wrapper to simplify the execution of Single Cell analysis with the sc verse
Other
3 stars 0 forks source link

coreSCpy Pipeline

Developer: Elizabeth Aslinger (easlinger)

Correspondence: elizabeth.aslinger@aya.yale.edu

Jira Epic


Installation

  1. Open a Unix terminal (often Ctrl + Alt + T on Linux).

  2. Install conda environment from .yml file (replace "env-corescpy" with desired environment name): conda create -n corescpy python=3.10.4 # create python environment

  3. Activate the conda environment with conda activate corescpy.

  4. Clone the repository to your local computer: git clone git@github.com:ChoBioLab/corescpy.git, git clone https://github.com/ChoBioLab/corescpy.git, or look above for the green "Code" button and press it for instructions.

  5. Navigate to the repository directory (replace with your path): cd <DIRECTORY>

  6. Install

  7. Install the package with pip. (Ensure you have pip installed.) pip install .

  8. If you have issues with resolving/finding the most up-to-date version of the spatialdata and/or spatialdata-io packages, try running:

    pip install git+https://github.com/scverse/spatialdata
    pip install git+https://github.com/scverse/spatialdata-io

    in your terminal while in your conda environment, then re-try step (6). If you have an M1 Mac, see this thread about known compatibility issues with pertpy if you have issues with the install.

  9. If you're planning to use this environment with Jupyter notebooks, run conda install nb_conda_kernels, then pip install ipykernel.

If you have issues importing modules or functions (particularly if it only happens if you don't run the import after launching python while you are in the corescpy directory), try mv <CONDA_ENV_PATH>/site-packages/_corescpy.pth <CONDA_ENV_PATH>/_corescpy.pth.bak (replacing with your conda site-packages path, e.g., /home/elizabeth/elizabeth/miniconda3/envs/corescpy/lib/python3.10/site-packages), then pip uninstall corescpy then cd corescpy (replace "corescpy" with path to your corescpy top-level directory if needed) then pip install -e .. Then try cd to return to your home directory, then python -c "import corescpy; print(dir(corescpy))" from your terminal. Make sure it prints out submodules (e.g., analysis), and not just the base attributes (e.g., __doc__).

** Note: To use GPU resources, use conda install -c rapidsai -c nvidia -c conda-forge cugraph cuml cudf and install the gpu version of coreSCpy (which should pip install scanpy[rapids]).

** Tip: If you run out of space, run:

pip cache purge
conda clean -i
conda clean -t

If you have issues seeing the environment when choosing the kernel for your Jupyter notebook:

conda install nb_conda_kernels

Usage

  1. You can now load corescpy like any other distributed Python package. Open a Python terminal and type: import corescpy as cr

  2. You can now call functions from the analysis module using cr.ax.<FUNCTION>(), from the preprocessing using cr.ax.pp..., etc. in Python; however, you are most likely to interact with the Omics class object, or specialized classes that inherit from it, such as Crispr and Spatial. class object. Here is example code you might run (replacing things in < > brackets with your specifications):

    self = cr.Omics(<data_object_or_directory>, <...>)

    or

    self = cr.Crispr(<data_object_or_directory>, <...>)

    or

    self = cr.Spatial(<data_object_or_directory>, <...>)

and then run workflows, such as

self.preprocess(<...>)
self.cluster(<...>)
self.annotate_clusters("<CellTypist model.pkl>")
self.plot(kind=["heat", "matrix", "umap"])

etc.

Here are the methods (applicable to scRNA-seq generally, not just perturbations) in order of a typical workflow (replace ... with argument specifications):

The following perturbation-specific methods can be executed optionally and in any order:

Spatial Data

Here is an example workflow to analyze spatial data (after preprocessing and clustering as described above):

self.calculate_centrality(n_jobs=4)
self.find_cooccurrence(figsize=(60, 20), kws_plot=dict(wspace=3))
self.find_svgs(genes=genes, method="moran", n_perms=10, kws_plot=dict(
    legend_fontsize="large"), figsize=(15, 15))
self.calculate_receptor_ligand(col_condition=False, p_threshold=0.001,
                               remove_ns=True, figsize=(20, 20))

Perturbation Data

Package Overview

Argument Conventions:

Certain arguments used throughout the corescpy package (including outside the corescpy.crispr_class.Crispr() class), hold to conventions intended to foster readability, maintain consistency, and promote clarity, both for end-users and future developers.

Initialization Method Arguments

Click to expand details or - a dictionary, keyed by sample name, containing multiple `file_path`-compatible arguments for each sample (for integration). ``` crd = "" # e.g., "/home/asline01/projects/corescpy/examples/data/crispr-screening/HH03" subd = "" # e.g., "filtered_feature_bc_matrix" proto = "" # e.g., "crispr_analysis/protospacer_calls_per_cell.csv" file_path = dict(directory=crd, subdirectory_mtx=subd, file_protospacer=proto) ``` If you have the typical/default file tree/naming (e.g., "filtered_feature_bc_matrix" and "crispr_analysis/protospacer_calls_per_cell.csv" are contained in the directory defined in `file_path["directory"]`), you should be able to specify just `file_path=dict(directory=)` (e.g., `file_path=dict(directory="/home/projects/crispr-screening/crispr-screening/analysis/cellranger/cr_count_2023-05-15_1837/HH02/outs")`).

or

Crispr Object Properties

The corescpy.crispr_object.Crispr() class object is an end user's main way of interacting with the package as a whole. (See above for an overview of the workflow.) See the notebooks in /examples for additional help.

Major Attributes Descriptions

Accessing AnnData and Attributes Directly and Using Aliases

The AnnData object is stored in the attribute adata, so if your object is called self, you can access it using self.adata. (For examples going forward, we will assume the object is called self, but you can substitute any name you want by assigning the Crispr object to some other name instead.)

If you have multiple modalities, you can access the gene expression modality using either self.adata[self._assay] (having specified assay=the name of the RNA modality in your AnnData, which is usually "rna," in the Cripsr() initialization method call when you first create your object) or using the alias self.rna.

Thus, if you have multi-modal data in self.adata, it's convenient to access the AnnData attributes specifically of your AnnData's gene expression modality using, for instance, the alias self.rna.obs instead of the long-form self.adata[self._assay].obs.

These aliases are not only convenient for their brevity, but also allow for a more generalizable way to call specific objects. For instance, if you wanted to write a script that frequently calls the .obs attribute of the RNA data, and you want it to work for both uni- and multi-modal data, instead of repeatedly writing, for example:

if self._assay is None:
    custom_function(self.adata[self._assay].obs)
else:
    custom_function(self.adata[self._assay].obs)

you may simply say self.rna.obs, knowing it will work whether or not multiple assays exist in the object's AnnData attribute.

Finally, this approach saves memory: All these versions of the attribute are stored in a single place in memory so you can call the attributes in various ways without duplicating them and taking up more space.


Resources for Background Knowledge

Pertpy (Perturbation/Conditions Analysis) Tutorials

Squidpy (Spatial) Tutorials

Single Cell Best Practices

Augur

Mixscape (Seurat)