Closed crazyhottommy closed 5 years ago
I would just write an hts-nim program to do this. Here is some untested, hastily written code to get you started:
import hts
import os
import strutils
import tables
var ibam:Bam
# lookup from cb -> cluster
var clusterTbl = initTable[string,string]()
# lookup from cluster -> bam
var tbl = initTable[string, Bam]()
for x in paramStr(1).lines:
var toks = x.strip().split(",")
if toks[0] == "Barcode": continue
clusterTbl[toks[0]] = toks[1]
if not open(ibam, paramStr(2)):
quit "couldn't open bam"
for aln in ibam:
var cbt = tag[string](aln, "CB")
if cbt.isNone: continue
var cb = cbt.get
if cb notin clusterTbl: continue
var cluster = clusterTbl[cb]
if cluster notin tbl:
var obam: Bam
if not open(obam, "out-cluster-" & cluster & ".bam", mode="wb"):
quit "couldn't open bam for writing"
obam.write_header(ibam.hdr)
tbl[cluster] = obam
tbl[cluster].write(aln)
for k, bam in tbl:
bam.close()
ibam.close()
Thanks @brentp ! I will test this and compare the speed with pysam.
actually, I didn't see it needed to look at the cluster id. I just split by cb tag. you'll have to add the part where it adds a lookup from cb -> cluster then writes to the bam of that cluster.
okay! I am not familiar with the syntax of nim, so I did not notice you did not do that immediately. thanks for the heads-up
I updated the comment to include the lookup from cb -> cluster.
you run as $prog cluster.txt $bam
this also elides some error checking for brevity.
awesome! saved me some time. thanks.
for
if cb is not present in the cluster file, this will raise an error
I actually want to pass instead of giving errors.
in python dict, I use dict.get() and return None if CB not present in the cluster table. How to do it in nim? I thought it would be faster for you to update the code than I googling around :) thanks!
I updated the code to handle that. Also, here is a binary, just unzip, chmod +x and run. If you have a relatively recent libc, it should work. Still depends on libhts.so. tommy.gz
great! I am curious to see how fast it is compared with pysam. will report back. thanks again!
yeah. interested to hear. you can add threads=2
(or 3) to the open
calls to get a bit more speed on de/compressing the bam which will be the most CPU time.
good to know the tip. I have not figured out how to multi-thread for pysam. so I will compare single thread for both at the moment.
ok. and if compiling on your own. remember to build with -d:release
thanks! I was compiling myself and it said succeeded.
echo $LD_LIBRARY_PATH
/n/helmod/apps/centos7/Core/bzip2/1.0.6-fasrc01/lib:/n/helmod/apps/centos7/Core/xz/5.2.2-fasrc01/lib64:/n/helmod/apps/centos7/Core/gcc/7.1.0-fasrc01/lib64:/n/helmod/apps/centos7/Core/gcc/7.1.0-fasrc01/lib:/n/helmod/apps/centos7/Core/mpc/1.0.3-fasrc06/lib64:/n/helmod/apps/centos7/Core/mpfr/3.1.5-fasrc01/lib64:/n/helmod/apps/centos7/Core/gmp/6.1.2-fasrc01/lib64:/n/helmod/apps/centos7/Core/libpng/1.6.25-fasrc01/lib:/n/helmod/apps/centos7/Core/hdf5/1.10.1-fasrc03/lib::/n/home02/mtang/apps/elastix/lib:/n/home02/mtang/apps/htslib-1.6/libhts.so
./split_scATAC_bam
could not load: libhts.so
compile with -d:nimDebugDlOpen for more information
#same with the file you uploaded
./tommy
could not load: libhts.so
compile with -d:nimDebugDlOpen for more information
but I have libhts in LD_LIBRARY_PATH
LD_LIBRARY_PATH
is like PATH
it takes directories, not files.
sorry my brain was not working...after add the folder to LD_LIBRARY_PATH
time ./tommy clusters.csv possorted_bam.bam
src/hts/simpleoption.nim(30) get
Error: unhandled exception: Can't obtain a value from a `none` [KeyError]
real 0m0.349s
user 0m0.008s
sys 0m0.011s
it has to do with the dictionary I think, sometimes CB tag maybe not present?
updated the snippet to handle this. note the line with isNone
.
Thanks! was googling around for how to test a string is none or not in nim. tried cb.isNil... thanks for saving me. need to learn basics of nim. it is running now will report back.
that's returning an option value from the tag method. see docs here: https://brentp.github.io/hts-nim/hts/bam.html#tag%2CRecord%2Cstring then you can follow the link to the option: https://brentp.github.io/hts-nim/hts/simpleoption.html#Option I was previously calling get, but as you can see from the description, it throws and error if the value (tag) is absent.
Thanks for the links!
just checked, it is writing to sam not bam.
changed if not open(obam, "out-cluster-" & cluster & ".bam", mode="w")
to if not open(obam, "out-cluster-" & cluster & ".bam", mode="wb")
one more little thing, if I want to skip the header, how should I do it?
for x in paramStr(1).lines:
var toks = x.strip().split(",")
clusterTbl[toks[0]] = toks[1]
I updated the snippet again (also to write bam). Basically:
if toks[0] == "Barcode": continue
or you can track the line number yourself and if line_no == 1: continue
thanks! that works, but one has to know the column names beforehand. so tracking the line number is better.
in python3, I can do header = next(csv) to skip it.
the pysam
version seems to work, see https://hackmd.io/X2K8TpKeRJyb2ibK8PLzew?both#use-pysam
reporting back. It is 1.5x faster using hts-nim
than pysam
. impressive.
the bam file after splitting is a bit different:
# by pysam
samtools flagstat cluster10.bam
23691881 + 0 in total (QC-passed reads + QC-failed reads)
180 + 0 secondary
0 + 0 supplementary
13520847 + 0 duplicates
23499718 + 0 mapped (99.19% : N/A)
23691701 + 0 paired in sequencing
11846803 + 0 read1
11844898 + 0 read2
23310179 + 0 properly paired (98.39% : N/A)
23383493 + 0 with itself and mate mapped
116045 + 0 singletons (0.49% : N/A)
40060 + 0 with mate mapped to a different chr
20207 + 0 with mate mapped to a different chr (mapQ>=5)
# by hts-nim
samtools flagstat atac_v1_pbmc_5k/outs/out-cluster-10.bam
23174666 + 0 in total (QC-passed reads + QC-failed reads)
176 + 0 secondary
0 + 0 supplementary
13207244 + 0 duplicates
23034081 + 0 mapped (99.39% : N/A)
23174490 + 0 paired in sequencing
11587245 + 0 read1
11587245 + 0 read2
22873454 + 0 properly paired (98.70% : N/A)
22942804 + 0 with itself and mate mapped
91101 + 0 singletons (0.39% : N/A)
38116 + 0 with mate mapped to a different chr
19077 + 0 with mate mapped to a different chr (mapQ>=5)
pysam version
import pysam
import csv
cluster_dict = {}
with open('clusters.csv') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
#skip header
header = next(csv_reader)
for row in csv_reader:
cluster_dict[row[0]] = row[1]
clusters = set(x for x in cluster_dict.values())
fin = pysam.AlignmentFile("atac_v1_pbmc_5k_possorted_bam.bam", "rb")
# open the number of bam files as the same number of clusters, and map the out file handler to the cluster id, write to a bam with wb
fouts_dict = {}
for cluster in clusters:
fout = pysam.AlignmentFile("cluster" + cluster + ".bam", "wb", template = fin)
fouts_dict[cluster] = fout
for read in fin:
tags = read.tags
# the 8th item is the CB tag
CB_list = [ x for x in tags if x[0] == "CB"]
if CB_list:
cell_barcode = CB_list[0][1]
# the bam files may contain reads not in the final clustered barcodes
# will be None if the barcode is not in the clusters.csv file
cluster_id = cluster_dict.get(cell_barcode)
if cluster_id:
fouts_dict[cluster_id].write(read)
## do not forget to close the files
fin.close()
for fout in fouts_dict.values():
fout.close()
There are a little more reads pulled out by pysam
.
download the bam file and the clusters.csv files from http://cf.10xgenomics.com/samples/cell-atac/1.0.1/atac_v1_pbmc_5k/atac_v1_pbmc_5k_analysis.tar.gz
hold on...I just noticed I was using different bam files to begin with, one is the one downloaded and the other one is from fastq that I processed with cellranger. Let me test on the same bam...
I won't have time to debug this. The code looks fine to me. If you have some difference, I would start by finding a single read that's in pysam output but not hts-nim (or vice-versa) and figuring out why.
sure. totally understand.
were you able to resolve this?
was distracted...working on it now. will report back.
update, not sure why pysam
pull out some read records without the CB tag.
## by pysam
samtools flagstat cluster12.bam
16771134 + 0 in total (QC-passed reads + QC-failed reads)
140 + 0 secondary
0 + 0 supplementary
9381485 + 0 duplicates
16637530 + 0 mapped (99.20% : N/A)
16770994 + 0 paired in sequencing
8385399 + 0 read1
8385595 + 0 read2
16508476 + 0 properly paired (98.43% : N/A)
16556331 + 0 with itself and mate mapped
81059 + 0 singletons (0.48% : N/A)
24024 + 0 with mate mapped to a different chr
12056 + 0 with mate mapped to a different chr (mapQ>=5)
# by hts-nim
samtools flagstat out-cluster-12.bam
16407030 + 0 in total (QC-passed reads + QC-failed reads)
136 + 0 secondary
0 + 0 supplementary
9162203 + 0 duplicates
16309886 + 0 mapped (99.41% : N/A)
16406894 + 0 paired in sequencing
8203447 + 0 read1
8203447 + 0 read2
16201216 + 0 properly paired (98.75% : N/A)
16246368 + 0 with itself and mate mapped
63382 + 0 singletons (0.39% : N/A)
22746 + 0 with mate mapped to a different chr
11304 + 0 with mate mapped to a different chr (mapQ>=5)
There are more reads pulled out by pysam
than hts-nim
samtools view cluster12.bam | cut -f1 | sort -u > cluster12_readIDs.txt
samtools view out-cluster-12.bam | cut -f1 | sort -u > out-cluster-12_readIDs.txt
comm -23 cluster12_readIDs.txt out-cluster-12_readIDs.txt > cluster12_specific_readIDs.txt
I then checked one of the reads that pysam specific. There is no CB tag, how come it was pulled out by my pysam code, to begin with...?
samtools view cluster12.bam | grep -w A00519:57:HG3JCDMXX:1:1101:10375:20400
A00519:57:HG3JCDMXX:1:1101:10375:20400 1187 chr8 21769249 60 49M = 21769275 76 ACCCGGAAGCGAGAAGGAGAGGCTGATAACGGGGCTGGCCCAGGCTGGG FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F NM:i:0 MD:Z:49 MC:Z:50M AS:i:49 XS:i:20 CR:Z:GGCAATTTGCTGGAAT CY:Z:,F,,,,,F,,F,F:,, BC:Z:CTATACGC QT:Z:FFFFFFFF GP:i:1414564938 MP:i:1414565014 MQ:i:60 RG:Z:atac_v1_pbmc_5k:MissingLibrary:1:HG3JCDMXX:1
that is odd. The pysam code looks sane to me.
I can not figure out... this bothers me! It has to do with the paired-end reads?
samtools view atac_v1_pbmc_5k_possorted_bam.bam | grep -w "A00519:57:HG3JCDMXX:2:1318:26684:24220"
A00519:57:HG3JCDMXX:2:1318:26684:24220 163 chr1 30687631 60 49M = 30687748 167 TAGATGGGATACTTCATATCATCCCCCTGCCCTTCCATTGCATTAAGAT FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F NM:i:0 MD:Z:49 MC:Z:50M AS:i:49
XS:i:19 CR:Z:ATGGCTAGGTGTCGAG CY:Z::FFF::,FFFF,:F,, BC:Z:AGGGCGTT QT:Z::FFFFF:F GP:i:30687630 MP:i:30687797 MQ:i:60
RG:Z:atac_v1_pbmc_5k:MissingLibrary:1:HG3JCDMXX:2
A00519:57:HG3JCDMXX:2:1318:26684:24220 83 chr1 30687748 60 50M = 30687631 -167 TCTTCCAATCCATCATCCTACTTTCAGGCTGGGAAGAGCGTGTTCTGATT FFFFFFFFFFFFFFFFFFFFF:FFF:FFFFFFFFFFFFFFFFFFFFFFFF NM:i:0 MD:Z:50 MC:Z:49M AS:i:50
XS:i:20 CR:Z:ATGGCTAGGTGTCGAG CY:Z::FFF::,FFFF,:F,, BC:Z:AGGGCGTT QT:Z::FFFFF:F GP:i:30687797 MP:i:30687630 MQ:i:60
RG:Z:atac_v1_pbmc_5k:MissingLibrary:1:HG3JCDMXX:2
There are two reads with the same id from the original bam. only one of them appeared in the split bam by pysam.
it's a bug in your pysam code. you have:
CB_list = [ x for x in tags if x[0] == "CB"]
if CB_list:
cell_barcode = CB_list[0][1]
so if CB_list is empty, then cell_barcode is left as the value of the last seen CB tag. you need an else that does continue
thanks a billion! This saves me. will update the code.
just confirm that now pysam and hts-nim produce identical results after updating the code. thanks!
subset-bam from 10X seems to be fit for the job. Did not try yet.
Hi Brent,
I want to split a 10x single cell ATACseq bam files by group of cells. each cell has a barcode, and I have txt files containing the barcodes for each group. The barcode is in the
tag CB:Z:AAACTGCAGAGCAGCT-1
see here what I have tried https://hackmd.io/X2K8TpKeRJyb2ibK8PLzew?viewwhat's the best way to do it with bam-filter?
Thanks! Tommy