aertslab / pycisTopic

pycisTopic is a Python module to simultaneously identify cell states and cis-regulatory topics from single cell epigenomics data.
Other
59 stars 12 forks source link

Combining bulk ATAC-seq with scATAC-seq #35

Closed ivanferrreira closed 2 years ago

ivanferrreira commented 2 years ago

I have a high-resolution scATAC dataset for one tissue, but I would like to combine this data with bulk ATAC-seq from different tissues/embryonic stages for topic modeling as a means of identifying pleiotropic/shared topics between different cells types and developmental stages.

Is there a way that I can do this by adding my bulk ATAC-seq BAM/BigWigs files to pycisTopic after the pseudobulk step, and then jointly run the downstream steps?

Thanks - any help on this would be greatly appreaciated!

-Ivan

cbravo93 commented 2 years ago

Hi @ivanferrreira !

Yes, this is technically possible! Our approach to obtain topic models from bulk data is to simulate single cells, as topic modelling likes sparse data. We have used this approach in Fig S3 in the first cisTopic paper, Fig 3 in the SCENIC+ paper and Minnoye et al. However, in all cases the data was simulated from bulk data from very homogenous samples (cell lines); is this the case for your bulks? We also haven't tested combining scATAC-seq and simulated cells in the same analysis, but it could work; you may need to apply batch correction as in here.

Here goes the code to simulate fragments files from bulk, then you can treat them as regular scATAC-seq for the rest of the pipeline. You can also use peaks called in the original bulk sample instead of the pseudobulk of the simulated cells to create the consensus peaks:

"""
Simulate fragments
"""
import os
import numpy
import pandas as pd
import pysam

def makeSimulatedCell(numReads, bulkFile, name, directory, debug=False):
    """
    Create a simulated cell.

    Make a simulated cell, with a (roughly) equal number of reads as another
    cell. Sort the resulting bam file. RNG seed is a concatination of the
    number of reads and the name of the cell, therefore using the same
    parameters should return the same result.

    Parameters
    ----------
    numReads : int or list of int
        Number of reads to generate for the simulated cell.
    bulkFile : string
        String of path to BAM/SAM file of bulk sample to generate simulated
        cell from.
    name : string or list of strings
        Name to give simulated cell. Automatically recieves ".sorted.bam"
        suffix and will generate index.
    directory : string
        Directory to put simulated cell BAM file in.
    """
    import random
    bulk = pysam.AlignmentFile(bulkFile, "rb")
    bulkReads = [read for read in bulk]
    if type(numReads) == int:
        numReads = [numReads]
    if type(name) == str:
        name = [name]
    for num, sample in enumerate(name):
        random.seed(str(numReads[num]) + sample)
        with pysam.AlignmentFile(os.path.join(directory,
                                 sample + ".bam"),
                                 "wb", header=bulk.header) as simCell:
            for read in random.sample(bulkReads, numReads[num]):
                simCell.write(read)
        pysam.sort(os.path.join(directory, sample + ".bam"),
                   '-T tmp',
                   '-o',
                   os.path.join(directory, sample + ".sorted.bam"),
                   catch_stdout=False)
        os.remove(os.path.join(directory, sample + ".bam"))
        pysam.index(os.path.join(directory, sample + ".sorted.bam"),
                    catch_stdout=False)

path_to_data='/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/DPCL/data/bam/atac/'
sample_dict={}
sample_dict['GM12878']=path_to_data+'ATAC_GM12878_1_ENCFF415FEC.bam'
sample_dict['HCT116']=path_to_data+'ATAC_HCT116_1_ENCFF724QHH.bam'
sample_dict['HepG2']=path_to_data+'ATAC_HepG2_1_ENCFF239RGZ.bam'
sample_dict['IMR90']=path_to_data+'ATAC_IMR-90_1_ENCFF848XMR.bam'
sample_dict['K562']=path_to_data+'ATAC_K562_1_ENCFF512VEZ.bam'
sample_dict['MCF7']=path_to_data+'ATAC_MCF-7_1_ENCFF772EFK.bam'
sample_dict['PC3']=path_to_data+'ATAC_PC-3_1_ENCFF516GDK.bam'
sample_dict['Panc2']=path_to_data+'ATAC_Panc2_1_ENCFF836WDC.bam'

# This will generate a single bam per simulated cell
out_path='/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/DPCL/data/single-cell_bam/atac'
for key in sample_dict.keys():
    numReads=numpy.random.randint(20000, high=20001, size=501)
    name=list()
    for i in range(1,501):
            cell_name= key + '_' + str(i)
            name.append(cell_name)
    makeSimulatedCell(numReads, sample_dict[key], name, out_path, debug=False)

# Make fragments files
export PATH="/staging/leuven/stg_00002/lcb/cbravo/software/bedops/bin/:$PATH"
dir='/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/DPCL/data/single-cell_bam/atac'
cd ${dir}
for file in ${dir}/*.bam
do
    fileName=`basename ${file}`
    fileName=${fileName%.sorted.bam*}
    echo ${fileName}
    bam2bed < $file | cut -f1,2,3 | awk '{print $0, "'$fileName'"}' | sort | uniq -c | awk 'BEGIN {OFS="\t"}; {print $2,$3,$4,$5,$1}' > ${fileName}.tsv
done

Let us know if you have further questions!

C