dpryan79 / bison

Bisulfite alignment on nodes of a cluster
Other
9 stars 2 forks source link

Bison: bisulfite alignment on nodes of a cluster.

N.B.: There is now a tutorial available here. This tutorial largely replaces this README file and users are encouraged to read it.

If you use Bison in your work please site the following: Ryan D.P. and Ehninger D. Bison: bisulfite alignment on nodes of a cluster. BMC Bioinformatics 2014, Oct 18;15(1):337

Usage

One can index all fasta files (files with extension .fa or .fasta) in a directory as follows:

bison_index [OPTIONS] directory/

Options that are not specific to bison are simply passed to bowtie2, which must be in your PATH. The output is placed under directory/bisulfite_genome.

Alignment can be performed as follows (bison_herd is the same):

mpiexec bison [OPTIONS] -g directory/ {-1 fastq_1.gz -2 fastq_2.gz | -U fastq.fq}

"directory" is identical to that used for indexing. For further details type "bison -h". For non-directional libraries, "mpiexec -N 5" should be used, otherwise "mpiexec -N 3". Resource managers, such as slurm, should work in an equivalent manner. All options not explicitly mentioned by typing "bison -h" are passed to bowtie2. Consequently, using the --very-sensitive or --dovetail options will work as expected. Bison already passes the following flags to bowtie2:

-q --reorder --no-mixed --no-discordant

bison_herd is equivalent, except that you can specify more nodes. You may also input multiple files (comma-separated, no spaces) to align, in which case alignments will be printed to multiples files. Furthermore, you may use wild-cards in your file list. For example:

mpiexec -N 17 bison_herd -o Alignments -g directory/ -1 exp1/sample*_1.fq.gz,/some/other/path/foo*_1.fq.gz -2 exp1/sample*_2.fq.gz,/some/other/path/foo*_2.fq.gz

Make sure to not have multiple input files with the same name (e.g., sample*/read1.fastq), as they will all be written to the same file (overwriting any subsequent alignments)!

In non-targeted sequencing experiments, it is usually wise to mark likely PCR duplicates. These are then ignored by the methylation extractor so as to not increase the error rate of any particular position. bison_markduplicates and able to process a BAM file generated by bison/bison_herd and produce an identical BAM file with the 0x400 bit set in the FLAG field for reads that are likely duplicates. This step is not required and should be avoided if you are performing RRBS or other targeted sequencing, as the false-positive rate of will then be too high.

There is also a methylation extractor that produces a bedGraph file, called bison_methylation_extractor. Note, coordinate-sorted BAM files should not be used! The methylation extractor can be told to ignore certain parts of each read. This is particularly useful in cases where there is methylation bias across the length of reads (i.e., if one plots the average methylation percentage summed per position over all reads, the value goes up/down toward the 5' or 3' end). It is recommended to always run bison_mbias (with the -pdf option if you have R and ggplot2 installed) to generate the required information for constructing an M-bias plot. The bison_mbias2pdf script can convert this to a PDF file (or a series of PNG files) and will also suggest what, if any, regions should be ignored. These regions are strand and read number (in the case of paired-end reads) dependent. While the suggested regions are often good, the should not be blindly accepted (just look at the graph and use your best judgement).

See the "Auxiliary files" section, below, for additional files.

Auxiliary files

The following programs and scripts will be available if you type "make auxiliary":

bedGraph2BSseq.py

This python script can accept a filename prefix and the names of at least 2 bedGraph files and output 3 files for input into BSseq. A single chromosome can be processed at a time, if desired, by using the -chr option. The output files will be named $prefix.M, $prefix.Cov, and $prefix.gr. $prefix.M is a matrix with a header line that lists the number of reads supporting methylation at each site in the bedGraph files. If there is no coverage in a given sample, the value is set to 0. $prefix.Cov is the analogous file listing coverage in each sample (again, 0 denotes no coverage). $prefix.gr lists the coordinates for each line in the .Cov and .M files. Loading these files into R would be performed as follows (in this example "Chr17" was the prefix):

M <- as.matrix(read.delim("Chr17.M", header=T))
Cov <- as.matrix(read.delim("Chr17.Cov", header=T))
bed <- read.delim("Chr17.bed", header=F)
#Remember that BED and bedGraph files are 0-based!
gr <- GRanges(seqnames=Rle(bed$V1),ranges=IRanges(start=bed$V2+1, end=bed$V3), strand=Rle("*", nrow(bed)))
groups <- data.frame(row.names=colnames(M),
    var1 <- c(1,1,1,1,2,2,2,2)) #A very simple experiment with 2 groups of 4 samples
BS1 <- BSseq(M=M, Cov=Cov, gr=gr, pData=groups, sampleNames=colnames(M)) #You'll want to set some of the additional options!

bedGraph2methylKit

As above, but each bedGraph file is converted to a .methylKit file. The bedGraphs should be of CpGs and not have had the strands merged (i.e., don't run the merge_CpGs command below).

bedGraph2MOABS

Like bedGraph2methylKit, but each bedGraph file is converted to a .moabs file. The bedGraph files should ideally contain single-C metrics rather than having been merged to form CpG metrics, though both are supported. The resulting .moabs files can then be used by mcomp in the MOABS package.

bedGraph2MethylSeekR

As above, but each bedGraph file is converted into a .MethylSeekR file. The bedGraphs MUST be merged before-hand with bison_merge_CpGs to create per-CpG metrics, as this is what MethylSeekR is expecting. Input is performed with the readMethylome() function. Chromosome lengths can be computed with: samtools faidx genome.fa where genome.fa is a multifasta file containing all of the chromosomes. The resulting .fai file is simply a text file and can be loaded into R with:

fai <- read.delim("genome.fa.fai", header=F)
chromosome_lengths <- fai$V2
names(chromosome_lengths) <- fai$V1
d <- readMethylome("file.MethylSeekR", chromosome_lengths)

make_reduced_genome

Create a reduced representation genome appropriate for reads of a given size ($size, default is 36bp). MspI and TaqI libraries are supported. Nucleotides greater than $size+10% are converted to N.

merge_bedGraphs.py

This will merge bedGraphs from technical replicates of a single sample into a single bedGraph file, summing the methylation metrics as it goes. The output, like the input is coordinate sorted.

bison_merge_CpGs

Methylation is usually symmetric at CpG sites. While the output bedGraph files have a single-C resolution, this will convert that to single-CpG resolution by summing Cs in the same CpG from opposite strands. This saves space and will often speed up downstream statistics.

Importing into other analysis packages

While there are helper scripts, mentioned above, for a number of packages, other packages either do not require a helper script or can use one of the aforementioned scripts. Import instructions for such packages are mentioned below.

BiSeq

BiSeq requires input in an identical format as BSseq. Consequently, just use the bedGraph2BSseq.py helper script. The following example commands should then suffice to load everything into R:

exptData <- SimpleList(Sequencer="Some sequencer", Year="2014") #This is just descriptive information
M <- as.matrix(read.delim("chr17.M", header=T))
Cov <- as.matrix(read.delim("chr17.Cov", header=T))
bed <- read.delim("chr17.bed", header=F)
gr <- GRanges(seqnames=Rle(bed$V1),ranges=IRanges(start=bed$V2+1, end=bed$V3), strand=Rle("*", nrow(bed)))
groups <- DataFrame(row.names=colnames(M),
    group = c(1,1,1,1,2,2,2,2)) #A very simple experiment with 2 groups of 4 samples
d <- BSraw(exptData=exptData, rowData=gr, colData=groups, totalReads=Cov, methReads=M)

BEAT

The BEAT Bioconductor package conveniently expects per-sample position and methylation information in a format already present in bedGraph files. However, this information is in a slightly different format than bedGraph, so the following awk script can be used. Note that BEAT expects files named as sample_name.positions.csv.

awk '{if(NR>1){printf("%s,%i,%i,%i\n",$1,$2+1,$5,$6)}else{printf("chr,pos,meth,unmeth\n")}}' sample.bedGraph > sample.positions.csv

Advanced bison_herd usage

bison_herd has the ability to use a semi-arbitrary number of nodes. In practice, if bison is given N nodes, it will effectively use 2*((N-1)/2)+1 or 4*((N-1)/4)+1 nodes, for directional and non-directional libraries, respectively. As an example, if you allot 20 nodes for a directional library, bison_herd will only use 19 of them (17 for non-directional reads). The excess nodes will exit properly and, unless you specify --quiet, produce an error message.

The options -mp, -queue-size, and -@ are bison_herd-specific and deserve further description.

-mp sets the number of threads that the master node will use to process alignments produced by the worker nodes. Worker nodes are grouped into twos or fours, where each group has the a number of nodes equal to the number of possible bisulfite converted strands. As the number of allocated nodes increases, a point is eventually reached where a single thread on the master node is unable to keep up with the workers. In my experience, for directional libraries, one thread can handle approximately 130 bowtie2 threads (i.e., if using -p 11, -mp should be increased once ~12 worker nodes are allocated, since that would equate to 132 threads in use by bowtie2). One should keep in mind that there are already at least 3 other threads concurrently running on the master node (sending and storing fastq reads, receiving alignments, and writing alignments). Consequently, there is a practical limit to the number of nodes is determined by how many cores are available on each node.

-queue-size determines the maximum difference between reads sent for alignment and reads processed. This option is unavailable if bison_herd was compiled with -DNOTHROTTLE. By default, the thread that sends reads for alignment will pause if it has sent more than ~1 million reads than have been processed. The purpose of this is to prevent overwhelming of the MPI unexpected message buffer, since the thread on the master node that sends reads can generally process reads faster than all of the worker nodes combined can align them. Setting this value too high may result in bison_herd crashing with otherwise cryptic messages involving MPI_Send. In such cases, decreasing the value used by -queue-size should resolve the problem. On the other hand, setting this value too low can result in a deadlocks, due to buffering at various levels. The default value hasn't resulted in deadlocking or crashes on our cluster, but yours may be different! This difference is checked every 100000 reads, which can changed by editting the THROTTLE_CHECK_INTERVAL value in bison.h prior to compilation.

-@ specifies the number of compression threads used for writing the output BAM file. In practice, a single compression thread can write ~80 million paired-end reads per hour (depending on CPU speed). I routinely use -@ 4 when using more than ~9 nodes as this allows writing to occur as quickly as reads are processed. To determine if the number of compression threads should be increased, not the time difference (especially early on) between when each master processor thread has processed 100000 reads and when those reads have been written to a file. Even when --reorder is used, if there is >1 second between these, then you may benefit from increasing the number of compression threads. For those curious, this option is identical to that used in samtools.

Throttling

bison_herd generally uses blocking, but not synchronous sends. What this means in practice is that many reads will be queued by the master node for sending to the worker nodes. Likewise, many alignments can be queued by the worker nodes for sending back to the master node. The queue that many MPI implementations use for this is relatively small and immutable. While a full queue should cause MPI_Send to block until there is sufficient space, occasionally a constellation of events can occur that cause this queue to overflow and the master node to then crash. This can be alleviated by limiting the possible number of reads that could ever possibly be in the queue at any single time. As the queue is not directly pollable, the difference between the number of reads sent and written is used as a surrogate. The maximum number of reads in the wild is then either 2x or 4x this difference (since a read is queued per worker node). In reality, the queue should be emptier than this as there are normally reads buffered on the worker nodes (being fed to bowtie2, being aligned or being sent) and elsewhere on the master node (being received, waiting to be processed, being processed, waiting to be written, or being written).

Throttling is not always required, particularly as an increasing number of nodes are used. Throttling can be disabled altogether by compiling with -DNOTHROTTLE, which will remove all related components.

Debug mode

For debugging, a special debug mode is available for both bison and bison_herd by compiling with -DDEBUG. Instead of running of needing multiple nodes, both programs will then run as if they were just a single node. Compiling with this option adds the -taskid option to both programs. The taskid is equivalent to the node number in the bison (or bison_herd) hierarchy. Node 0 is the master node and performs the final file writing. For bison, nodes 1-4 are equivalent to the worker nodes that align reads to the original top, original bottom, complementary to original top and complementary to original bottom strands, respectively. For directional libraries, only the first 2 are used. These will write alignments to a file for final processing when run as taskid 0. This is useful when odd alignments are being output and the source of the error needs to be tracked down. The mode for bison_herd is similar, except there are always 8 theoretical worker nodes (i.e., taskid 1-8 need to be run prior to taskid 0). This allows testing multiple master processor threads with both directional and non-directional reads.

In general, this mode should not be used unless you are running into extremely odd bugs.

Compatibility with Bismark

Bison is generally similar to bismark, however the indexes are incompatible, due to bismark renaming contigs. Also, the two will not produce identical output, due to algorithmic differences. Running bison_methylation_extractor on the output of bismark will also produce different results, again due to algorithmic differences. In addition, bison always outputs BAM files directly.

Other details

Bison needn't be run on multiple computers. You can also use a single computer for all compute nodes (e.g. mpiexec -n 5 bison ...). The same holds true for bison_herd. Both bison and bison_herd seem to be faster than bismark, even when limited to the same resources.

Changes

0.4.0

0.3.3

0.3.2b

0.3.2

0.3.1

0.3.0

0.2.4

0.2.3

0.2.2

0.2.1

0.2.0

0.1.1

0.1.0

Initial release