BioJulia / Bio.jl

[DEPRECATED] Bioinformatics and Computational Biology Infrastructure for Julia
http://biojulia.dev
MIT License
261 stars 65 forks source link

BioArrays - Containing multiple BioSequences #282

Closed TransGirlCodes closed 8 years ago

TransGirlCodes commented 8 years ago

Currently in Bio.jl, if you want to do something with multiple sequences you have to either iterate over a vector of sequences and do each op on each individual sequence. Or worse if you are like me and often want to do operations site by site, where the same site in every sequence is examined in turn, then you have to employ double loops or smarter iterator use.

I'd like to propose a new type or a modification of BioSequence, that allows storing of multiple biological sequences as a matrix. i.e Think of BioSequence is an alias for BioArray{A,1}, where BioArray{A,2} allows you to store multiple sequences in a contiguous block of memory in site-major, or sequence major order (maybe a type parameter can denote site-major and sequence major order).

@bicycle1885 @kdmurray91 What are your thoughts? Personally I think this would be a huge benefit to the kind of algorithms and operations that are common in my kind of field, and popular genotype file formats like PED and BED (Binary PED, not to be confused with the other BED format) would require some kind of type like this anyway in the future, it would be amazing if we could massage BioSequence to do this so as it just works for sequences, BED and anything else that ever wants a set of symbols in either site major or sequence major order.

On the face of it I can imagine it working, it would require the data field of BioSequence to be an Array{Uint64,N} instead of Vector{UInt64} and then some modifications to the getting and setting related code. Obviously there will be other things to resolve.

bicycle1885 commented 8 years ago

Using the usual matrix type would be sufficient in most cases. Are there any benefits defining such type?

TransGirlCodes commented 8 years ago

You wouldn't get the compression and fast bit operations with a normal matrix. The benefits lie in having multiple sequences stored in the same chunk of contiguous memory, you could take advantage of BioSequence bitwise operations, not only across one sequence, but also across sites.

One concrete example where this would assist people interested in population genetics and genetic variance, would be the genetic distance and mismatch counting algorithms. Currently the distance(::Vector{BioSequence}) method executes distance(::BioSequence, ::BioSequence) repeatedly for every pair of sequences, and this itself contains loops. So you're looping about quite a lot and different sequences are located in different areas of memory. If there were a 2d type that was site-major in its order, the counting of mutations and computation of distance can be done in one method a-la the dna.dist C code in ape (https://github.com/cran/ape/blob/master/src/dist_dna.c). Having multiple sequences in a site-major matrix format allows you to iterate site by site in the same major order as julia's matrices, which should improve efficiency and minimise cache misses IIRC.

bicycle1885 commented 8 years ago

BioSequence has inevitable overhead when accessing elements because it needs a few extra memory access, bit shift, and bit-wise AND operations. In terms of performance, BioSequence is not the best way to implement extremely performance-sensitive algorithms. For example, pairalign is 3-4 times slower when we use BioSequence instead of Vector. The main reason of using BioSequence is that it saves memory footprint and we can safely override behaviors for biological sequences (for example, creating a subsequence is copy-free, which is different from the Vector type). If you're willing to use sequence sets for internal computing for some algorithms, it would be the fastest approach to allocate a temporary matrix and do things on it. I want to see at least one example that benefits from the encoded sequence set rather the usual matrix before deciding to introduce a new type.

TransGirlCodes commented 8 years ago

Ok You've convinced me, I shall create a method which takes Vector::{BioSequence}, and creates a corresponding site major matrix, if you believe that is the best way to increase performance of my distance and mutation count algorithms.