bam-filter split bam file by groups of barcodes in single cell ATAC bam

crazyhottommy commented 5 years ago

Hi Brent,

I want to split a 10x single cell ATACseq bam files by group of cells. each cell has a barcode, and I have txt files containing the barcodes for each group. The barcode is in the tag CB:Z:AAACTGCAGAGCAGCT-1 see here what I have tried https://hackmd.io/X2K8TpKeRJyb2ibK8PLzew?view

what's the best way to do it with bam-filter?

Thanks! Tommy

brentp commented 5 years ago

I would just write an hts-nim program to do this. Here is some untested, hastily written code to get you started:

import hts
import os
import strutils
import tables
var ibam:Bam

# lookup from cb -> cluster
var clusterTbl = initTable[string,string]()
# lookup from cluster -> bam
var tbl = initTable[string, Bam]()

for x in paramStr(1).lines:
  var toks = x.strip().split(",")
  if toks[0] == "Barcode": continue
  clusterTbl[toks[0]] = toks[1]

if not open(ibam, paramStr(2)):
   quit "couldn't open bam"

for aln in ibam:
  var cbt = tag[string](aln, "CB")
  if cbt.isNone: continue
  var cb = cbt.get

  if cb notin clusterTbl: continue
  var cluster = clusterTbl[cb]
  if cluster notin tbl:
    var obam: Bam
    if not open(obam, "out-cluster-" & cluster & ".bam", mode="wb"):
      quit "couldn't open bam for writing"
    obam.write_header(ibam.hdr)
    tbl[cluster] = obam
  tbl[cluster].write(aln)

for k, bam in tbl:
  bam.close()
ibam.close()

crazyhottommy commented 5 years ago

Thanks @brentp ! I will test this and compare the speed with pysam.

brentp commented 5 years ago

actually, I didn't see it needed to look at the cluster id. I just split by cb tag. you'll have to add the part where it adds a lookup from cb -> cluster then writes to the bam of that cluster.

crazyhottommy commented 5 years ago

okay! I am not familiar with the syntax of nim, so I did not notice you did not do that immediately. thanks for the heads-up

brentp commented 5 years ago

I updated the comment to include the lookup from cb -> cluster. you run as $prog cluster.txt $bam

this also elides some error checking for brevity.

crazyhottommy commented 5 years ago

awesome! saved me some time. thanks.

crazyhottommy commented 5 years ago

for

if cb is not present in the cluster file, this will raise an error

I actually want to pass instead of giving errors.

in python dict, I use dict.get() and return None if CB not present in the cluster table. How to do it in nim? I thought it would be faster for you to update the code than I googling around :) thanks!

brentp commented 5 years ago

I updated the code to handle that. Also, here is a binary, just unzip, chmod +x and run. If you have a relatively recent libc, it should work. Still depends on libhts.so. tommy.gz

crazyhottommy commented 5 years ago

great! I am curious to see how fast it is compared with pysam. will report back. thanks again!

brentp commented 5 years ago

yeah. interested to hear. you can add threads=2 (or 3) to the open calls to get a bit more speed on de/compressing the bam which will be the most CPU time.

crazyhottommy commented 5 years ago

good to know the tip. I have not figured out how to multi-thread for pysam. so I will compare single thread for both at the moment.

brentp commented 5 years ago

ok. and if compiling on your own. remember to build with -d:release

crazyhottommy commented 5 years ago

thanks! I was compiling myself and it said succeeded.

echo  $LD_LIBRARY_PATH
/n/helmod/apps/centos7/Core/bzip2/1.0.6-fasrc01/lib:/n/helmod/apps/centos7/Core/xz/5.2.2-fasrc01/lib64:/n/helmod/apps/centos7/Core/gcc/7.1.0-fasrc01/lib64:/n/helmod/apps/centos7/Core/gcc/7.1.0-fasrc01/lib:/n/helmod/apps/centos7/Core/mpc/1.0.3-fasrc06/lib64:/n/helmod/apps/centos7/Core/mpfr/3.1.5-fasrc01/lib64:/n/helmod/apps/centos7/Core/gmp/6.1.2-fasrc01/lib64:/n/helmod/apps/centos7/Core/libpng/1.6.25-fasrc01/lib:/n/helmod/apps/centos7/Core/hdf5/1.10.1-fasrc03/lib::/n/home02/mtang/apps/elastix/lib:/n/home02/mtang/apps/htslib-1.6/libhts.so

./split_scATAC_bam
could not load: libhts.so
compile with -d:nimDebugDlOpen for more information

#same with the file you uploaded 
./tommy
could not load: libhts.so
compile with -d:nimDebugDlOpen for more information

but I have libhts in LD_LIBRARY_PATH

brentp commented 5 years ago

LD_LIBRARY_PATH is like PATH it takes directories, not files.

crazyhottommy commented 5 years ago

sorry my brain was not working...after add the folder to LD_LIBRARY_PATH

time ./tommy clusters.csv possorted_bam.bam
src/hts/simpleoption.nim(30) get
Error: unhandled exception: Can't obtain a value from a `none` [KeyError]

real    0m0.349s
user    0m0.008s
sys     0m0.011s

it has to do with the dictionary I think, sometimes CB tag maybe not present?

brentp commented 5 years ago

updated the snippet to handle this. note the line with isNone.

crazyhottommy commented 5 years ago

Thanks! was googling around for how to test a string is none or not in nim. tried cb.isNil... thanks for saving me. need to learn basics of nim. it is running now will report back.

brentp commented 5 years ago

that's returning an option value from the tag method. see docs here: https://brentp.github.io/hts-nim/hts/bam.html#tag%2CRecord%2Cstring then you can follow the link to the option: https://brentp.github.io/hts-nim/hts/simpleoption.html#Option I was previously calling get, but as you can see from the description, it throws and error if the value (tag) is absent.

crazyhottommy commented 5 years ago

Thanks for the links!

just checked, it is writing to sam not bam. changed if not open(obam, "out-cluster-" & cluster & ".bam", mode="w") to if not open(obam, "out-cluster-" & cluster & ".bam", mode="wb")

one more little thing, if I want to skip the header, how should I do it?

for x in paramStr(1).lines:
  var toks = x.strip().split(",")
  clusterTbl[toks[0]] = toks[1]

brentp commented 5 years ago

I updated the snippet again (also to write bam). Basically:

  if toks[0] == "Barcode": continue

or you can track the line number yourself and if line_no == 1: continue

crazyhottommy commented 5 years ago

thanks! that works, but one has to know the column names beforehand. so tracking the line number is better. in python3, I can do header = next(csv) to skip it. the pysam version seems to work, see https://hackmd.io/X2K8TpKeRJyb2ibK8PLzew?both#use-pysam

crazyhottommy commented 5 years ago

reporting back. It is 1.5x faster using hts-nim than pysam. impressive.

crazyhottommy commented 5 years ago

the bam file after splitting is a bit different:

# by pysam
samtools flagstat cluster10.bam
23691881 + 0 in total (QC-passed reads + QC-failed reads)
180 + 0 secondary
0 + 0 supplementary
13520847 + 0 duplicates
23499718 + 0 mapped (99.19% : N/A)
23691701 + 0 paired in sequencing
11846803 + 0 read1
11844898 + 0 read2
23310179 + 0 properly paired (98.39% : N/A)
23383493 + 0 with itself and mate mapped
116045 + 0 singletons (0.49% : N/A)
40060 + 0 with mate mapped to a different chr
20207 + 0 with mate mapped to a different chr (mapQ>=5)

# by hts-nim 
samtools flagstat atac_v1_pbmc_5k/outs/out-cluster-10.bam
23174666 + 0 in total (QC-passed reads + QC-failed reads)
176 + 0 secondary
0 + 0 supplementary
13207244 + 0 duplicates
23034081 + 0 mapped (99.39% : N/A)
23174490 + 0 paired in sequencing
11587245 + 0 read1
11587245 + 0 read2
22873454 + 0 properly paired (98.70% : N/A)
22942804 + 0 with itself and mate mapped
91101 + 0 singletons (0.39% : N/A)
38116 + 0 with mate mapped to a different chr
19077 + 0 with mate mapped to a different chr (mapQ>=5)

pysam version

import pysam
import csv

cluster_dict = {}
with open('clusters.csv') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    #skip header
    header = next(csv_reader)
    for row in csv_reader:
        cluster_dict[row[0]] = row[1]

clusters = set(x for x in cluster_dict.values())
fin = pysam.AlignmentFile("atac_v1_pbmc_5k_possorted_bam.bam", "rb")

# open the number of bam files as the same number of clusters, and map the out file handler to the cluster id, write to a bam with wb
fouts_dict = {}
for cluster in clusters:
    fout = pysam.AlignmentFile("cluster" + cluster + ".bam", "wb", template = fin)
    fouts_dict[cluster] = fout

for read in fin:
    tags = read.tags
    # the 8th item is the CB tag
    CB_list = [ x for x in tags if x[0] == "CB"]
    if CB_list:
        cell_barcode = CB_list[0][1]
    # the bam files may contain reads not in the final clustered barcodes
    # will be None if the barcode is not in the clusters.csv file
    cluster_id = cluster_dict.get(cell_barcode)
    if cluster_id:
        fouts_dict[cluster_id].write(read)

## do not forget to close the files
fin.close()
for fout in fouts_dict.values():
    fout.close()

There are a little more reads pulled out by pysam.

download the bam file and the clusters.csv files from http://cf.10xgenomics.com/samples/cell-atac/1.0.1/atac_v1_pbmc_5k/atac_v1_pbmc_5k_analysis.tar.gz

http://s3-us-west-2.amazonaws.com/10x.files/samples/cell-atac/1.0.1/atac_v1_pbmc_5k/atac_v1_pbmc_5k_possorted_bam.bam

crazyhottommy commented 5 years ago

hold on...I just noticed I was using different bam files to begin with, one is the one downloaded and the other one is from fastq that I processed with cellranger. Let me test on the same bam...

brentp commented 5 years ago

I won't have time to debug this. The code looks fine to me. If you have some difference, I would start by finding a single read that's in pysam output but not hts-nim (or vice-versa) and figuring out why.

crazyhottommy commented 5 years ago

sure. totally understand.

brentp commented 5 years ago

were you able to resolve this?

crazyhottommy commented 5 years ago

was distracted...working on it now. will report back.

crazyhottommy commented 5 years ago

update, not sure why pysam pull out some read records without the CB tag.

## by pysam
samtools flagstat cluster12.bam
16771134 + 0 in total (QC-passed reads + QC-failed reads)
140 + 0 secondary
0 + 0 supplementary
9381485 + 0 duplicates
16637530 + 0 mapped (99.20% : N/A)
16770994 + 0 paired in sequencing
8385399 + 0 read1
8385595 + 0 read2
16508476 + 0 properly paired (98.43% : N/A)
16556331 + 0 with itself and mate mapped
81059 + 0 singletons (0.48% : N/A)
24024 + 0 with mate mapped to a different chr
12056 + 0 with mate mapped to a different chr (mapQ>=5)

# by hts-nim
samtools flagstat out-cluster-12.bam
16407030 + 0 in total (QC-passed reads + QC-failed reads)
136 + 0 secondary
0 + 0 supplementary
9162203 + 0 duplicates
16309886 + 0 mapped (99.41% : N/A)
16406894 + 0 paired in sequencing
8203447 + 0 read1
8203447 + 0 read2
16201216 + 0 properly paired (98.75% : N/A)
16246368 + 0 with itself and mate mapped
63382 + 0 singletons (0.39% : N/A)
22746 + 0 with mate mapped to a different chr
11304 + 0 with mate mapped to a different chr (mapQ>=5)

There are more reads pulled out by pysam than hts-nim

samtools view cluster12.bam | cut -f1 | sort -u > cluster12_readIDs.txt

samtools view out-cluster-12.bam | cut -f1 | sort -u > out-cluster-12_readIDs.txt

comm -23 cluster12_readIDs.txt out-cluster-12_readIDs.txt > cluster12_specific_readIDs.txt

I then checked one of the reads that pysam specific. There is no CB tag, how come it was pulled out by my pysam code, to begin with...?

samtools view cluster12.bam | grep -w A00519:57:HG3JCDMXX:1:1101:10375:20400
A00519:57:HG3JCDMXX:1:1101:10375:20400  1187    chr8    21769249        60      49M     =       21769275        76      ACCCGGAAGCGAGAAGGAGAGGCTGATAACGGGGCTGGCCCAGGCTGGG  FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F       NM:i:0  MD:Z:49 MC:Z:50M        AS:i:49 XS:i:20 CR:Z:GGCAATTTGCTGGAAT   CY:Z:,F,,,,,F,,F,F:,,      BC:Z:CTATACGC   QT:Z:FFFFFFFF   GP:i:1414564938 MP:i:1414565014 MQ:i:60 RG:Z:atac_v1_pbmc_5k:MissingLibrary:1:HG3JCDMXX:1

brentp commented 5 years ago

that is odd. The pysam code looks sane to me.

crazyhottommy commented 5 years ago

I can not figure out... this bothers me! It has to do with the paired-end reads?

samtools view atac_v1_pbmc_5k_possorted_bam.bam | grep -w "A00519:57:HG3JCDMXX:2:1318:26684:24220"
A00519:57:HG3JCDMXX:2:1318:26684:24220  163     chr1    30687631        60      49M     =       30687748        167     TAGATGGGATACTTCATATCATCCCCCTGCCCTTCCATTGCATTAAGAT      FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F       NM:i:0  MD:Z:49 MC:Z:50M        AS:i:49
        XS:i:19 CR:Z:ATGGCTAGGTGTCGAG   CY:Z::FFF::,FFFF,:F,,   BC:Z:AGGGCGTT   QT:Z::FFFFF:F   GP:i:30687630   MP:i:30687797   MQ:i:60
        RG:Z:atac_v1_pbmc_5k:MissingLibrary:1:HG3JCDMXX:2
A00519:57:HG3JCDMXX:2:1318:26684:24220  83      chr1    30687748        60      50M     =       30687631        -167    TCTTCCAATCCATCATCCTACTTTCAGGCTGGGAAGAGCGTGTTCTGATT     FFFFFFFFFFFFFFFFFFFFF:FFF:FFFFFFFFFFFFFFFFFFFFFFFF      NM:i:0  MD:Z:50 MC:Z:49M        AS:i:50
        XS:i:20 CR:Z:ATGGCTAGGTGTCGAG   CY:Z::FFF::,FFFF,:F,,   BC:Z:AGGGCGTT   QT:Z::FFFFF:F   GP:i:30687797   MP:i:30687630   MQ:i:60
        RG:Z:atac_v1_pbmc_5k:MissingLibrary:1:HG3JCDMXX:2

There are two reads with the same id from the original bam. only one of them appeared in the split bam by pysam.

brentp commented 5 years ago

it's a bug in your pysam code. you have:

    CB_list = [ x for x in tags if x[0] == "CB"]
    if CB_list:
        cell_barcode = CB_list[0][1]

so if CB_list is empty, then cell_barcode is left as the value of the last seen CB tag. you need an else that does continue

crazyhottommy commented 5 years ago

thanks a billion! This saves me. will update the code.

crazyhottommy commented 5 years ago

just confirm that now pysam and hts-nim produce identical results after updating the code. thanks!

vertesy commented 5 years ago

subset-bam from 10X seems to be fit for the job. Did not try yet.

brentp / hts-nim-tools

bam-filter split bam file by groups of barcodes in single cell ATAC bam #5

if cb is not present in the cluster file, this will raise an error