YiPeng-Gao / scDaPars

Dynamic Analysis of Alternative Polyadenylation from single-cell RNA-seq (scDaPars)
10 stars 1 forks source link

generated one BAM file for each single cell #7

Closed ghost closed 3 years ago

ghost commented 3 years ago

Hello, can you provide an operation for this step? Write a script to operate this step. The 1G file is split into files for each cell. This process takes 10h. Each bam file will be split into thousands of files

image

YiPeng-Gao commented 3 years ago

You can use "pysam.AlignmentFile" to open one bam file for each single cell and then extract and combine reads based on the cell barcode. For example: pysam.AlignmentFile("cell_bam_files/cell.bam", "wb", template = bam_file)

ghost commented 3 years ago

What is the software parameter setting of UMI-tools?

After using filterbam to filter bam files, remove reads with duplicated UMIs using UMI-tools dedup After setting the corresponding parameters, UMi displays the error content: WARNING The BAM did not contain any valid reads/read pairs for deduplication

image

YiPeng-Gao commented 3 years ago

What is the software parameter setting of UMI-tools?

After using filterbam to filter bam files, remove reads with duplicated UMIs using UMI-tools dedup After setting the corresponding parameters, UMi displays the error content: WARNING The BAM did not contain any valid reads/read pairs for deduplication

image

You can use the command "umi_tools dedup -I input.bam -S output.bam --method=unique --extract-umi-method=tag --umi-tag=UB --cell-tag=CB" to remove UMI duplcation. However, I am not the developer for UMI-tools, so I cannot help you debug your code. I will only answer questions related to the scDaPars algorithm.

YiPeng-Gao commented 3 years ago

Run 87M files, but the speed is relatively slow, run 4h, analyze files to 1/4

!/usr/bin/env python

-- coding:utf-8 --

import pysam import sys import os import re import threading

unsplit1 = sys.argv[1] unsplit = str(unsplit1)

out_dir1 = sys.argv[2] out_dir = str(out_dir1)

def run(unsplit_file,out_dir): itr = 0 CB_hold = 'unset' samfile = pysam.AlignmentFile(unsplit_file, "rb") for read in samfile.fetch(until_eof=True): if 'CB' in str(read): print("read:" + str(read)) CB_itr = read.get_tag('CB') print("CB_itr : "+str(CB_itr)) if( CB_itr!=CB_hold or itr==0): if(itr!=0): splitfile = pysam.AlignmentFile('%s/CB%s.bam'%(out_dir,itr), "wb", template=samfile) split_file.close() CB_hold = CB_itr itr = itr + 1 splitfile = pysam.AlignmentFile('%s/CB%s.bam'%(out_dir,itr), "wb", template=samfile) split_file.write(read) split_file.close() samfile.close()

if name == 'main': for i in range(5): t = threading.Thread(target=run(unsplit,out_dir)) t.start()

It is slow, but yours seems too slow. You can first open the number of bam files as the same number of cells without closing. Then write reads into each of those empty bam files. It will save you some time.

ghost commented 3 years ago

Do I need to split each bam file into each cell bam file? Can it be split into bam files of a single chromosome for analysis? In this step, splitting bam will consume more resources and more time If there are many samples, the operation will consume more time and the steps will be more complicated How long did you spend on this step?

YiPeng-Gao commented 3 years ago

Do I need to split each bam file into each cell bam file? Can it be split into bam files of a single chromosome for analysis? In this step, splitting bam will consume more resources and more time If there are many samples, the operation will consume more time and the steps will be more complicated How long did you spend on this step?

No you cannot just separate into chromosomes, you have to separate into different cells. It takes me couple of hours to perform this one in a file contain ~3000 single cells.

ghost commented 3 years ago

Can you submit the code of the operation?