bhavaygg / InSTAnT

Intracellular Transcriptomic Analysis Toolkit (InSTAnT)
MIT License
4 stars 0 forks source link

InSTAnT

DOI License: MIT PyPI - Version Downloads

InSTAnT is a toolkit to identify gene pairs which are d-colocalized from single molecule measurement data e.g. MERFISH or SeqFISH. A gene pair is d-colocalized when their transcripts are within distance d across many cells.

This repository contains implementation of PP Test and CPB test and demo on a U2OS dataset. The dataset can be downloaded from here (Moffit et al., 2016, PNAS ) - http://zhuang.harvard.edu/MERFISHData/data_for_release.zip

Paper Preprint Link - https://pubmed.ncbi.nlm.nih.gov/36747718/

UPDATE: Added support for AnnData input/output. Added Frequent Subgraph Mining.

We recommend using our environment.yml file to create a new conda environment to avoid issues with package incompatibility.

conda env create -f environment.yml

This will create a new conda environment with the name instant and has all dependencies installed.

Alternatively, the package can be installed using pip.

pip install sc-instant

First, we will initialize the Instant class object. This object will allow us to calculate the proximal pairs and find global colocalized genes. The primary argument is threads which controls the number of threads the program uses. If you run into memory issues with the default settings, we suggest setting the precision_mode to low. The arguments min_intensity and min_area are used only for MERFISH data preprocessing and can be skipped otherwise.

from InSTAnT import Instant
obj = Instant(threads = threads, precision_mode = 'high', min_intensity = 10**0.75, min_area = 3)

To load MERFISH data, we use the function preprocess_and_load_data(). preprocess_and_load_data() is used to preprocess and load MERFISH data only. The final data is stored as a pandas DataFrame and the following columns ['gene', 'uID', 'absX', 'absY']. Below is an example table for a 2D data (if the data is 3D, absZ column is also expected) -

gene uID absX absY
AKAP11 2 -1401.666 -2956.618
SIPA1L3 3 -1411.692 -2936.609
THBS1 925 -764.6989 -1604.828
obj.preprocess_and_load_data(expression_data = f'data/u2os/rep3/data.csv', barcode_data = 'data/u2os/codebook.csv')

If the data has been preprocessed, we can load it like below.

obj.load_preprocessed_data(data = f'data/u2os_new/data_processed.csv')

Since, subcellular spatial transcriptomic data is generally present as .csv file containing all the transcripts, the primary input format is .csv. However, we have included the ability to format the data into an AnnData object and save the results of the subsequent analysis in that object. We convert the input file to an AnnData object and save it with the same name into the same directory. All the subsequent analysis will be updated and saved into the same file. We also provide the functionality to save any of the results seperately as individual files.

Note - The function also supports loading an AnnData object. Currently, we only accept files in .h5ad format. Since, Anndata objects are not natively designed for subcellular datasets, we expect the .csv file containing the transcript information to be present in adata.uns['transcripts']. If you wish to run differential colocalization, cell type labels are required, which are expected in adata.obs while for spatial modulation, cell locations are required, which are expected in adata.uns['cell_locations']. If these files are not present in the AnnData object, they must be supplied to the specific funtions seperately. If an AnnData object is provided during this loading, all the subsequent outputs are saved and updated in the specified file.

obj.load_preprocessed_data(data = f'data/u2os_new/data.h5ad')

The dataframe is loaded in the object variable df and can be accessed through obj.df. After the data has been loaded, we can calculate each cell's proximal gene pairs using the run_ProximalPairs() function. The following arguments are used by run_ProximalPairs()

All subsequent analysis require run_ProximalPairs() to be run first to generate the p-value matrix for all cells.

Next, we can use the output to find which gene pairs are significantly colocalized globally using the run_GlobalColocalization() function. The following arguments are used by run_GlobalColocalization()

The final outputs are 3 CSV files -

(Note - For AnnData objects, only the unstacked file is saved in adata.uns['cpb_results'].)

Next, we will use InSTAnT's spatial modulation analyses to find spatially modulated gene pairs. We use the run_spatial_modulation() function for this. The following arguments are used by run_spatial_modulation()

uID x_centroid y_centroid
1054 5785.578 5699.618
1059 5757.25 5770.57
1067 5781.67 5238.706
1068 5784.023 5177.076
1069 5772.406 5173.247
1071 5757.652 5815.921

The output is an Excel file containing the gene pairs and the log-likelihood ratio of their spatial modulation. Below is an example -

g1g2 gene_id1 gene_id2 llr w_h1 p_g_h1 p_g_h0
MALAT1, MALAT1 MALAT1 MALAT1 113.2089 0.524039 0.803285 0.805375
FASN, TLN1 FASN TLN1 61.48302 0.404623 0.808779 0.808774
COL5A1, THBS1 COL5A1 THBS1 58.72194 0.391671 0.729187 0.733704
MALAT1, SRRM2 MALAT1 SRRM2 47.82571 0.353254 0.410644 0.409021
COL5A1, FBN2 COL5A1 FBN2 47.62671 0.348113 0.574713 0.576769

(Note - For AnnData objects. These results are stored in adata.uns['spatial_modulation'].)

Lastly, we will calculate the cell type specificity of InSTAnT categorized d-colocalized gene pair. We call it Differential Colocalization and the function run_differentialcolocalization() is used for it. There are 3 different modes in which it can be run -

Arguments:

uID cell_type
83434 27
83431 27
83447 27
7469 27
91749 27
83131 27

The function will create a folder named folder_name in the file_location location. The folder will contain a folder for each of the cell types selected to be analyzed and for each cell type contains a single file will all relevant outputs.

(Note - For AnnData objects. These results are stored in adata.uns['differential_colocalization']. adata.uns['differential_colocalization'] is a dictionary with keys based on the analysis done. For 1va and ava, the key is {cell_type} and for 1v1, the key is {cell_type)_vs_{cell_type_2}).

UPDATE: Frequent Subgraph Mining.

Finds networks of genes colocalized in many cells using gSpan. The networks found are potential candidates for groups of cells and genes exhibiting interesting subcellular patterns, in this case colocalization. Such colocalization networks could potentially be underlying factors for specific biological processes.

obj.run_fsm(n_vertices = 4, alpha = 0.001)

Arguments:

THe function will create a dataframe in adata.uns with all the gene pair networks found. Ny default the key is nV{x}_cliques where x is the number of vertices but using the attribute fsm_name it can be changed.

Citation

@article{kumar2023intracellular,
  title={Intracellular Spatial Transcriptomic Analysis Toolkit (InSTAnT)},
  author={Kumar, Anurendra and Schrader, Alex W and Boroojeny, Ali Ebrahimpour and Asadian, Marisa and Lee, Juyeon and Song, You Jin and Zhao, Sihai Dave and Han, Hee-Sun and Sinha, Saurabh},
  journal={Research Square},
  year={2023},
  publisher={American Journal Experts}
}