BSMAP 1.0
BSMAP is a short reads mapping program for bisulfite sequencing in DNA methylation study. Bisulfite treatment coupled with next generation sequencing could estimate the methylation ratio of every single Cytosine location in the genome by mapping high throughput bisulfite reads to the reference sequences.
Bisulfite mapping is different from usual sequence mapping in two aspects: 1) The additional C/T mapping is asymmetric, a T in the read could be aligned to C in the reference but not vice versa. 2) The Watson and Crick strand are not complimentary after bisulfite treatment. Each read need to be compared with 4 reference sequences, namely BSW(bisulfite Watson), BSWC(reverse complimentary of BSW), BSC(bisulfite Crick) and BSCC(reverse complimentary of BS).
BSMAP is designed to be a general-purpose mapping program to handle these special characteristics of bisulfite mapping. It is based on the open source program SOAP (Short Oligo Alignment Program).
Main features: read length up to 144nt, allow up to 15 mismatches support pair end mapping, support parallel mapping, support SAM format inout/output support both whole genome (WGBS) and reduced representation bisulfite sequencing (RRBS) support trimming adapters and low quality sequences from 3'end of reads allow different running modes with flexible speed/memory/sensitivity to run on different hardware configurations include script to extract methylation ratios
BSMAP is under GNU Public License (GPL).
BSMAP is designed for linux64 platform. First unpackage the source code: $ tar zxfv bamsp-2.4.tgz
Make executable binary: $ make
Install the binary into system default path: (optional) $ make install
The following parameters could be modified in the makefile to achieve better mapping_speed/memory_usage under different situations:
OLIGOLEN: max read length, options: -DREAD_48, -DREAD_80, -DREAD_144(default) smaller max readlen will be faster.
bsmap
option:
-a
4.1 BSP format, includes the following tab delimited fields: id, seq, qual map_flag, ref, ref_loc, strand, ins_size, refseq, #mismatches, mismatches_info
1) id: read ID
2) seq: mapped read sequence
3) map_flag:
UM: unique map (unique pair for paired mapping).
MA: multiple map (multiple pair for paired mapping)
OF: over map (#multiple map exceeds MAXHITS)
NM: no map
QC: low quality reads
4) ref: reference sequence name
5) ref_loc: mapping location(1 based, 5'-end coordinates of the mapping region on the Watson strand of reference)
6) strand:
++: forward strand of Watson of reference (BSW)
+-: reverse strand of Watson of reference (BSWC)
-+: forward strand of Crick of reference (BSC)
--: reverse strand of Crick of reference (BSCC)
7) ins_size: insertion size for pair-end mapping, measured by the total nucleotide of the pair-end segment.
(edge to edge size). positive or negative value means mate's coordinate is larger or smaller respectively.
0 means single-end or unpaired mapping.
8) refseq: Waston reference sequence at the mapping location.
9) #mismatches: number of mismatches of current hit
10) mismatch_info: #hits of 0 mismatch to #hits of max_mismatches, separated by ':'
4.2 SAM format
FLAG field:
UM: 0x0
MA: 0x100
OF: 0x100
NM: 0x4
QC: 0x204
for mapping on BSC or BSCC: FLAG=FLAG+0x10
for pair-end mapping:
FLAG=FLAG+0x1
if it's the first read in pair, FLAG=FLAG+0x40
if it's the second read in pair, FLAG=FLAG+0x80
if mappings are paired, FLAG=FLAG+0x2
if mate is unmapped, FLAG=FLAG+0x8
if mate is mapped on BSC or BSCC, FLAG=FLAG+0x20
aux field:
ZS:Z:
ZP:i:
for more details, please refer to SAM format specification:
http://samtools.sourceforge.net/SAM1.pdf
Note: all read sequences are recorded as the corresponding sequence following the reference Watson strand direction.
The longer seed size(option -s), the faster speed. With seed size increase every bp, mapping time reduces by ~1.5-fold. On the other hand, the max number of mismatches that could be detected with 100% sensitivity is bounded by the seed_size.
max_mismatches_with_100%_sensitivity = (read_len+1-index_interval) / seed_size - 1
If the -v option set max mismatches larger than this number, those mappings with larger max mismatches may not be guaranteed to be detected.
In case full sensitivity can not be achieved within feasible time, user will need to make a decision on the trade off between the speed and sensitivity by setting the optimal seed size.
6 Example
for shot gun whole genome BS:
single_end: (-A means trimming adapter from 3'end)
$ bsmap -a SE_read.bam -d ~/ref/hg19/hg19.fa -o out.bam -p 8 -w 100 -v 5 -A AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGA
single_end: (map to all 4 possible strands)
$bsmap -a SE_read.bam -d ~/ref/hg19/hg19.fa -o out.bam -n 1 -w 100 -v 5
pair_end: (set -b option)
$ bsmap -a read1.fq -b read2.fq -d ~/ref/hg19/hg19.fa -o out_pair.bsp -2 out_upair.bsp -p 8 -w 100 -v 5
$ bsmap -a PE_reads.bam -b PE_reads.bam -d ~/ref/hg19/hg19.fa -o out.bam -p 8 -w 100 -v 5 -A AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGA
using less memory: (set -I option)
$ bsmap -a SE_read.bam -d ~/ref/hg19/hg19.fa -o out.bam -p 8 -v 5 -A AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGA -I 8
mapping from read pair #10001 to read pair #20000 in the input file: (set -B and -E option)
$ bsmap -a PE_reads.bam -b PE_reads.bam -d ~/ref/hg19/hg19.fa -o out.bam -w 100 -v 5 -B 10001 -E 20000
using Illumina quality: (set -z option)
$ bsmap -a PE_read1.fq -b PE_read2.fq -d ~/ref/hg19/hg19.fa -o out.bam -z 64
report only uniquely mapped reads: (set -r option)
$ bsmap -a reads.fastq -d hg19.fa -o out.bsp -r 0
trimming low quality 3'end: (set -q option)
$ bsmap -a PE_reads.bam -b PE_reads.bam -d ~/ref/hg19/hg19.fa -o out.bam -q 2
detect A=>G editing in RNA_seq instead of C=>T conversion in bisulfite sequencing
$ bsmap -a reads.bam -d RNA_ref.fa -M GA -o out.bam
for RRBS: (set -D option to specify digestion site information and activate RRBS mode.)
bsmap -a PE_reads.bam -b PE_reads.bam -d ~/ref/hg19/hg19.fa -o out.bam -p 8 -w 100 -s 12 -v 5 -D C-CGG
7.1 methratio.py
python script to extract methylation ratios from BSMAP mapping results. Require python 2.X.
For human genome, methratio.py needs ~26GB memory.
For systems with limited memory, user can set the -c/--chr option to process specified chromosomes only,
and combine results for all chromosomes afterwards.
Usage: python methratio.py [options] BSMAP_MAPPING_FILES
BSMAP_MAPPING_FILES could be one or more output files from BSMAP. The format will be determined by the filename suffix. (SAM format for .sam and .bam, BSP format for other filenames.)
Options: -h, --help show this help message and exit -o FILE, --out=FILE output file name. (required) -d FILE, --ref=FILE reference genome fasta file. (required) -c CHR, --chr=CHR process only specified chromosomes. [default: all] example: --chr=chr1,chr2 (this uses ~4.5GB compared with ~26GB for the whole genome) -s PATH, --sam-path=PATH path to samtools. [default: none] -u, --unique process only unique mappings/pairs. -p, --pair process only properly paired mappings. -z, --zero-meth report loci with zero methylation ratios. -q, --quiet don't print progress on stderr. -r, --remove-duplicate remove duplicated mappings to reduce PCR bias. (This option should not be used on RRBS data. For WGBS, sometimes it's hard to tell if duplicates are caused by PCR due to high seqeuncing depth.) -t N, --trim-fillin=N trim N fill-in nucleotides in DNA fragment end-repairing. [default:2] (This option is only for pair-end mapping. For RRBS, N could be detetmined by the distance between cuttings sites on forward and reverse strands. For WGBS, N is usually between 0~3.) -g, --combine-CpG combine CpG methylaion ratio from both strands. [default: False] -m FOLD, --min-depth=FOLD report loci with sequencing depth>=FOLD. [default: 1]
Output format: tab delimited txt file with the following columns: 1) chromorome 2) coordinate (1-based) 3) strand 4) sequence context (2nt upstream to 2nt downstream in Watson strand direction) 5) methylation ratio 6) number of converted and unconverted Cs covering this locus 7) number of unconverted Cs covering this locus 8) lower bound of 95% confidence interval of methylation ratio 9) upper bound of 95% confidence interval of methylation ratio
The confidence interval is based on Wilson score interval for binomial proportion.
Example: python methratio.py --chr=chr1,chr2 --ref=hg19.fa --out=methratio.txt rrbsmap_sample*.sam python methratio.py -d mm9.fa -o meth.txt -p bsmap_sample1.bsp bsmap_sample2.sam bsmap_sample3.bam python methratio.py -s /home/tools/samtools -t 1 -d arab.fa -o meth.txt bsmap_sample.sam
Note: For overlapping paired hits, nucleotides in the overlapped part should be counted only once instead of twice. methratio.py can correctly handle such cases for SAM format output, but for BSP format it will still be counted twice, because the BSP format does not contain mapping information of the mate.
7.2 sam2bam.sh
Shell script to convert SAM format to sorted and indexed BAM format.
The input SAM file will be deleted if the conversion is successful.
This script is automatically called by BSMAP if the output file has .bam suffix, which requires samtools and sam2bam.sh installed in system default path.
It can also be used manually.
Usage: ./sam2bam.sh INPUT_SAM_FILE
Example: ./sam2bam.sh sample1.sam This will generate sorted BAM file sample1.bam and index file sample1.bam.bai.
7.3 bsp2sam.py Python script to convert .BSP format output to SAM format output.
Usage: bsp2sam.py [options] BSP_FILE
Options: -h, --help show this help message and exit -o FILE, --out=FILE output file name. (required) -d FILE, --ref=FILE reference genome fasta file. (required) -q, --quiet don't print progress on stderr.
Example: python bsp2sam.py -d hg19.fa -o sample.sam sample.bsp
Note: For pair-end BSP output, this script will conver it into single-end SAM output, i.e. the mapping information remains, but the pairing information lost.
Citation Yuanxin Xi and Wei Li, "BSMAP: whole genome bisulfite sequence MAPping program" (2009) BMC Bioinformatics 2009, 10:232
Contact Yuanxin Xi Bioinformatics Division, Dan L. Duncan Cancer Center, Baylor College of Medicince, Houston, TX 77030, USA 713-798-6254 yxi@bcm.tmc.edu, xiyuanxin@gmail.com