csw / bioruby-maf

MAF parser for BioRuby
MIT License
11 stars 6 forks source link

bio-maf

Build Status

This is a plugin for BioRuby adding support for the Multiple Alignment Format (MAF), used in bioinformatics to store whole-genome sets of multiple sequence alignments.

This library provides indexed and sequential access to MAF data, as well as performing various manipulations on it and writing modified MAF files.

For more information, see the project wiki.

Developer documentation generated with YARD is available at rubydoc.info.

This is being developed by Clayton Wheeler as part of the Google Summer of Code 2012, under the auspices of the Open Bioinformatics Foundation. The development blog may be of interest.

Dependencies

Kyoto Cabinet is a database library, required for building MAF indexes. Install the core library in the appropriate way for your platform, as documented here.

If you're using MRI, the kyotocabinet-ruby gem will be used to interact with Kyoto Cabinet. For best performance, however, you should really consider using JRuby. On JRuby, the kyotocabinet-java gem will be used instead; this builds a Java library using JNI to call into Kyoto Cabinet. Please file a bug report if you encounter problems building or using this gem, which is still fairly new.

Installation

bio-maf is now published as a Ruby gem.

$ gem install bio-maf

Performance

This parser performs best under JRuby, particularly with Java

  1. See the Performance wiki page for more information. For best results, use JRuby in 1.9 mode with the ObjectProxyCache disabled:

    $ export JRUBY_OPTS='--1.9 -Xji.objectProxyCache=false'

Many parsing modes are multithreaded. Under JRuby, it will default to using one parser thread per available core, but if desired this can be configured with the :threads parser option.

Ruby 1.9.3 is fully supported, but does not perform as well, especially since its concurrency features are not useful for this workload.

Usage

Create an index on a MAF file

Much of the functionality of this library relies on an index. You can create one with maf_index(1), like so:

$ maf_index test/data/mm8_chr7_tiny.maf /tmp/mm8_chr7_tiny.kct

To index all sequences for searching, not just the reference sequence:

$ maf_index --all test/data/mm8_chr7_tiny.maf /tmp/mm8_chr7_tiny.kct

To build an index programmatically:

require 'bio-maf'
parser = Bio::MAF::Parser.new("test/data/mm8_chr7_tiny.maf")
idx = Bio::MAF::KyotoIndex.build(parser, "/tmp/mm8_chr7_tiny.kct", false)

Compress and index a MAF file

This library fully supports BGZF-compressed MAF files, which combine gzip compression with blocking for efficient random access. These can be generated with blocking optimized for MAF access using the included maf_bgzip(1) tool. This writes BGZF-compressed MAF files and optionally indexes them as well:

$ maf_bgzip --dir /tmp --index --all test/data/mm8.chrM.maf

This is the easiest way to prepare MAF files for use with this library.

Extract blocks from an indexed MAF file, by genomic interval

Refer to mm8_chr7_tiny.maf.

require 'bio-maf'
access = Bio::MAF::Access.maf_dir('test/data')

q = [Bio::GenomicInterval.zero_based('mm8.chr7', 80082592, 80082766)]
access.find(q) do |block|
  ref_seq = block.sequences[0]
  puts "Matched block at #{ref_seq.start}, #{ref_seq.size} bases"
end

# => Matched block at 80082592, 121 bases
# => Matched block at 80082713, 54 bases

Or, equivalently, one can work with a specific MAF file and index directly:

require 'bio-maf'
parser = Bio::MAF::Parser.new('test/data/mm8_chr7_tiny.maf')
idx = Bio::MAF::KyotoIndex.open('test/data/mm8_chr7_tiny.kct')

q = [Bio::GenomicInterval.zero_based('mm8.chr7', 80082592, 80082766)]
idx.find(q, parser).each do |block|
  ref_seq = block.sequences[0]
  puts "Matched block at #{ref_seq.start}, #{ref_seq.size} bases"
end

# => Matched block at 80082592, 121 bases
# => Matched block at 80082713, 54 bases

This can be done with maf_extract(1) as well:

$ maf_extract -d test/data --interval mm8.chr7:80082592-80082766

Extract alignment blocks truncated to a given interval

Given a genomic interval of interest, one can also extract only the subsets of blocks that intersect with that interval, using the #slice method like so:

require 'bio-maf'
access = Bio::MAF::Access.maf_dir('test/data')
int = Bio::GenomicInterval.zero_based('mm8.chr7', 80082350, 80082380)
blocks = access.slice(int).to_a
puts "Got #{blocks.size} blocks, first #{blocks.first.ref_seq.size} base pairs."
# => Got 2 blocks, first 18 base pairs.

Or, with maf_extract(1):

$ maf_extract -d test/data --mode slice --interval mm8.chr7:80082592-80082766

Filter species returned in alignment blocks

require 'bio-maf'
access = Bio::MAF::Access.maf_dir('test/data')

access.sequence_filter = { :only_species => %w(hg18 mm8 rheMac2) }
q = [Bio::GenomicInterval.zero_based('mm8.chr7', 80082592, 80082766)]
blocks = access.find(q)
block = blocks.first
puts "Block has #{block.sequences.size} sequences."

# => Block has 3 sequences.

With maf_extract(1):

$ maf_extract -d test/data --interval mm8.chr7:80082592-80082766 --only-species hg18,mm8,rheMac2

Extract blocks matching certain conditions

See also the Cucumber feature and step definitions for this.

Match only blocks with all specified species

access = Bio::MAF::Access.maf_dir('test/data')
q = [Bio::GenomicInterval.zero_based('mm8.chr7', 80082471, 80082730)]
access.block_filter = { :with_all_species => %w(panTro2 loxAfr1) }
n_blocks = access.find(q).count
# => 1

With maf_extract(1):

$ maf_extract -d test/data --interval mm8.chr7:80082471-80082730 --with-all-species panTro2,loxAfr1

Match only blocks with a certain number of sequences

access = Bio::MAF::Access.maf_dir('test/data')
q = [Bio::GenomicInterval.zero_based('mm8.chr7', 80082767, 80083008)]
access.block_filter = { :at_least_n_sequences => 6 }
n_blocks = access.find(q).count
# => 1

With maf_extract(1):

$ maf_extract -d test/data --interval mm8.chr7:80082767-80083008 --min-sequences 6

Match only blocks within a text size range

access = Bio::MAF::Access.maf_dir('test/data')
q = [Bio::GenomicInterval.zero_based('mm8.chr7', 0, 80100000)]
access.block_filter = { :min_size => 72, :max_size => 160 }
n_blocks = access.find(q).count
# => 3

With maf_extract(1):

$ maf_extract -d test/data --interval mm8.chr7:0-80100000 --min-text-size 72 --max-text-size 160

Process each block in a MAF file

require 'bio-maf'
p = Bio::MAF::Parser.new('test/data/mm8_chr7_tiny.maf')
puts "MAF version: #{p.header.version}"
# => MAF version: 1

p.each_block do |block|
  block.sequences.each do |seq|
    do_something(seq)
  end
end

Parse empty ('e') lines

Refer to chr22_ieq.maf.

require 'bio-maf'
p = Bio::MAF::Parser.new('test/data/chr22_ieq.maf',
                         :parse_empty => false)
block = p.parse_block
block.sequences.size
# => 3

p = Bio::MAF::Parser.new('test/data/chr22_ieq.maf',
                         :parse_empty => true)
block = p.parse_block
block.sequences.size
# => 4
block.sequences.find { |s| s.empty? }
# => #<Bio::MAF::EmptySequence:0x007fe1f39882d0 
#      @source="turTru1.scaffold_109008", @start=25049,
#      @size=1601, @strand=:+, @src_size=50103, @text=nil,
#      @status="I"> 

Such options can also be set on a Bio::MAF::Access object:

require 'bio-maf'
access = Bio::MAF::Access.maf_dir('test/data')
access.parse_options[:parse_empty] = true

Remove gaps from parsed blocks

After filtering out species with Parser#sequence_filter, gaps may be left where there was an insertion present only in sequences that were filtered out. Such gaps can be removed by setting the :remove_gaps parser option:

require 'bio-maf'
access = Bio::MAF::Access.maf_dir('test/data')
access.parse_options[:remove_gaps] = true

Join blocks after filtering together

Similarly, filtering out species may remove a species which had caused two adjacent alignment blocks to be split. By enabling the :join_blocks parser option, such blocks can be joined together:

require 'bio-maf'
access = Bio::MAF::Access.maf_dir('test/data')
access.parse_options[:join_blocks] = true

See the Cucumber feature for more details.

Extract bio-alignment representations of blocks

When the :as_bio_alignment parser option is given, blocks will be returned as Bio::BioAlignment::Alignment objects as used in the bio-alignment Biogem. This offers a great deal of built-in functionality for column-wise operations, alignment manipulation, and more.

require 'bio-maf'
access = Bio::MAF::Access.maf_dir('test/data')
access.parse_options[:as_bio_alignment] = true
q = [Bio::GenomicInterval.zero_based('mm8.chr7', 80082592, 80082766)]
access.find(q) do |aln|
  col = aln.columns[3]
  puts "bases in column 3: #{col}"
end

Tile blocks together over an interval

Extracts alignment blocks overlapping the given genomic interval and constructs a single alignment block covering the entire interval for the specified species. Optionally, any gaps in coverage of the MAF file's reference sequence can be filled in from a FASTA sequence file. See the Cucumber feature for examples of output, and also the maf_tile(1) man page.

require 'bio-maf'
access = Bio::MAF::Access.maf_dir('test/data')
interval = Bio::GenomicInterval.zero_based('mm8.chr7',
                                           80082334,
                                           80082468)
access.tile(interval) do |tiler|
  # reference is optional
  tiler.reference = 'reference.fa.gz'
  tiler.species = %w(mm8 rn4 hg18)
  # species_map is optional
  tiler.species_map = {
    'mm8' => 'mouse',
    'rn4' => 'rat',
    'hg18' => 'human'
  }
  tiler.write_fasta($stdout)
end

Compression

MAF files can optionally be compressed in the BGZF format defined in the SAM specification. This is best done with maf_bgzip(1), but files compressed with the bgzip(1) tool from samtools will also work, though less efficiently.

MAF files compressed with plain gzip will be decompressed on the fly, but random access to these files will not be possible. However, gzipped MAF files are suitable as input to maf_bgzip(1).

Command line tools

Man pages for command line tools:

With gem-man installed, these can be read with:

$ gem man bio-maf

Other documentation

Also see the API documentation. For more code examples see the RSpec and Cucumber test files in the source tree.

Also, the scripts in the bin directory provide good worked examples of how to use the existing parsing API.

Project home page

For information on the source tree, documentation, examples, issues and how to contribute, see

http://github.com/csw/bioruby-maf

The BioRuby community is on IRC server: irc.freenode.org, channel: #bioruby.

Cite

If you use this software, please cite one of

Biogems.info

This Biogem is published at biogems.info.

Copyright

Copyright (c) 2012 Clayton Wheeler. See LICENSE.txt for further details.