cistrome / MIRA

Python package for analysis of multiomic single cell RNA-seq and ATAC-seq.
52 stars 7 forks source link

How can peaks associated with a given gene be extracted? #10

Closed shaistamadad closed 2 years ago

shaistamadad commented 2 years ago

Hi, I want to extract the peaks associated with a given gene based on your RP model. How can I do that?

Many thanks for your help!

AllenWLynch commented 2 years ago

This is a feature we are currently working on. I will provide some code for you in the interim later today

shaistamadad commented 2 years ago

I will really appreciate that, thanks so much!!

AllenWLynch commented 2 years ago

Okay here's the code:

"get_normalized_params" gets the parameters of the LITE model, so you can get the upstream decay rate of influence of nearby cis-sites:

litemodel_params = litemodel.get_model('LEF1').get_normalized_params()

Use the "distance" key to get the decay distances (in KB, so multiply by 1000 to get bp). Upstream decay is [0], downstream is [1]. Note, upstream and downstream is relative to the gene's strand:

upstream_range = litemodel_params['distance'][0] * 1000

Next we find the idx the gene we wish to analyze to lookup TSS-peak distances:

gene_idx = np.argwhere(atac_data.uns['distance_to_TSS_genes'] == 'LEF1')[0,0]

The distance between every peak and every gene TSS is stored in adata.varm['distance_to_TSS'], which is a sparse matrix of (n_peaks x n_genes). We index on our gene of interest to get a vector of all the distances between peaks and that gene's TSS:

relevant_peaks = atac_data.varm['distance_to_TSS'].T[gene_idx] #

Peaks which are upstream have distance > 0, downstream are < 0. This line finds all peaks which are within 3*decay rate upstream of the TSS:

upstream_peaks = (relevant_peaks.data < 3*upstream_range) & (relevant_peaks.data > 0)

Then, we use the mask from the step above to index on the atac_data matrix to get the peak locations:

atac_data.var.iloc[ relevant_peaks.indices[upstream_peaks] ]

All together:

litemodel_params = litemodel.get_model('LEF1').get_normalized_params()
upstream_range = litemodel_params['distance'][0] * 1000
gene_idx = np.argwhere(atac_data.uns['distance_to_TSS_genes'] == 'LEF1')[0,0]
relevant_peaks = atac_data.varm['distance_to_TSS'].T[gene_idx]
upstream_peaks = (relevant_peaks.data < 3*upstream_range) & (relevant_peaks.data > 0)
atac_data.var.iloc[
    relevant_peaks.indices[upstream_peaks]
]
shaistamadad commented 2 years ago

thanks very much for this.

I get an error when I try to find index for the given gene in the atac_data.uns["distance_to_TSS_genes"]

gene_idx = np.argwhere(atac_data.uns['distance_to_TSS_genes'] == 'LEF1')[0,0]

IndexError: index 0 is out of bounds for axis 0 with size 0

This gene is present in the gene list in both rna and atac TSS:

if "LEF1" in atac_data.uns['distance_to_TSS_genes']:
    print("gene is present")
AllenWLynch commented 2 years ago

You may be using different names for you genes? Could LEF1 be lowercased in your TSS annotations?