Compute gene scores without GC content annotation

marie3003 commented 2 years ago

Hey, thanks a lot for your package. It is very useful:).

I'm currently working with a multiome dataset for which I want to calculate gene scores for the ATAC data to compare them to the gene expression data. I'm using the data set from the NeurIPS Competition Open Problems in Single Cell Analysis (https://openproblems.bio/neurips_docs/data/dataset/). However, in my dataset there is no GC annotation. That's why the method genescore.prepare_multiome_anndata() crashes and I'm not able to use the follow up methods to compute the gene scores. Do you provide a method to calculate this or do you have another easy way to add this annotation to the data? Or is there another way to calculate the gene scores without the GC annotation.

Thanks a lot for your help. Best regards, Marie Becker

ManuSetty commented 2 years ago

At this point, we do not yet provide a function of computing GC content but we will look to add this feature.

GC content represents the fraction of Gs or Cs in the peak sequence. This can be computed using Bioconda or Bioconductor packages. We have been using ArchR for ATAC preprocessing which generates this information automatically. GC content is unfortunately a requirement for computing gene scores since it is necessary for background peak identification for expression-accessibility correlations

marie3003 commented 2 years ago

Thanks a lot for the suggestions. This feature would be very useful.

I was able to calculate the gene score using genomepy and biopython. I extracted the peak sequences with genomepy and calculated the gc content with the GC function from Bio.SeqUtils. It was pretty fast and easy actually:).

ManuSetty commented 2 years ago

Awesome ! Do you mind sharing the code snippet ? We will add that into the functionality in the repo.

On Fri, May 20, 2022 at 02:39 Marie Becker @.***> wrote:

Thanks a lot for the suggestions. This feature would be very useful.

I was able to calculate the gene score using genomepy and biopython. I extracted the peak sequences with genomepy and calculated the gc content with the GC function from Bio.SeqUtils. It was pretty fast and easy actually:).

— Reply to this email directly, view it on GitHub https://github.com/dpeerlab/SEACells/issues/12#issuecomment-1132694592, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAQNH3BG7MO3YBJGDYHPLLTVK5MW3ANCNFSM5WMMXZCA . You are receiving this because you commented.Message ID: @.***>

-- Manu

marie3003 commented 2 years ago

Hi Manu,

sure, no problem. Thank you for adding this! Here is the code I used to download the right genome and to add the GC annotation.

import numpy as np
import anndata as ad

import genomepy
from Bio.SeqUtils import GC

adata_atac = ad.read_h5ad('../write/filtered_data_atac.h5ad')

# download genome from NCBI
genomepy.install_genome(name='GRCh38', provider='NCBI', genomes_dir = ''../data') # took about 9 min
genome = genomepy.Genome(name = 'GRCh38', genomes_dir = '../data')

GC_content_list = []

for i, region in enumerate(adata_atac.var_names):
    chromosome, start, end = region.split('-')
    chromosome = chromosome[3:]

    # get sequence
    sequence = str(genome.get_seq(name = chromosome, start = int(start), end = int(end)))

    # calculate GC content
    GC_content = GC(str(sequence))
    GC_content_list.append(GC_content)

# GC content ranges from 0% - 100%, should be 0 to 1
adata_atac.var['GC'] = GC_content_list
adata_atac.var.GC = adata_atac.var.GC/100

To find the right genome name and provider, I used

for provided_genome in genomepy.search('GRCh38', provider=None):
    print(provided_genome)

Best regards, Marie

ManuSetty commented 2 years ago

Thank you! @sitarapersad Can you please add this to the utils.

dpeerlab / SEACells

Compute gene scores without GC content annotation #12