[RFC] More nucleotides - Githubissues

bicycle1885 commented 8 years ago

I'd like to share and discuss ideas about DNA or RNA sequences, partially mentioned in #55.

Currently, only five nucleotides (A/C/G/T/N) are defined in this package and a DNA sequence cannot store other ambiguous nucleotides like M (A or C). This is because we use 2-bit encoding for A/C/G/T and 1-bit ambiguity mask for N. This limitation will definitely cause problems in real-world bioinformatics. Since we are creating a general purpose infrastructure for bioinformatics, this would be an intolerable limitation.

I think we should, at least, support all nucleotide base codes defined by IUPAC, which includes all possible combinations of four nucleotides (http://www.ncbi.nlm.nih.gov/Class/MLACourse/Modules/MolBioReview/iupac_nt_abbreviations.html). If we use 4-bit encoding, we can store any nucleotides listed here. These ambiguous nucleotides are, of course, allowed in the Biostrings package of Bioconductor. An apparent drawback of 4-bit encoding is that it requires more space than the current (2+1)-bit encoding: the data size of a DNA sequence will become 4/3 = 1.333 times larger. But I think this will not be a serious problem in most cases because it is very rare to handle such a large sequence that hits the RAM limit except eukaryotic reference sequences. For such long sequences, we can prepare a reference sequence type that adopts 2-bit encoding for A/C/G/T + compressed N mask, which is already implemented in my FMM.jl package (https://github.com/bicycle1885/FMM.jl/blob/master/src/genomicseq.jl).

One advantage of this 4-bit encoding is that it will improve the performance of random access on DNA sequences. This is because the current implementation stores the N mask in a separated bit vector object, and hence it has a branch (if seq.ns[i]) ... else ... end) to access an element in a sequence:

function getindex{T}(seq::NucleotideSequence{T}, i::Integer)
    if i > length(seq) || i < 1
        throw(BoundsError())
    end
    i += seq.part.start - 1
    if seq.ns[i]
        return nnucleotide(T)
    else
        return getnuc(T, seq.data, i)
    end

But using 4-bit encoding, we can store all information in a vector so that any branch is not need any more. That would lead to a significant performance improvement in intensive-sequence-access algorithms like pairwise alignment.

TransGirlCodes commented 8 years ago

I like the idea, providing we keep the 2 bit encoding too - Julia is all about method dispatch and proliferation of types, so I say I like the idea of being able to have a DNASequence that stores things as Two bits, and a DNASequence that can store things as 4 bits. We should offer choice - and sensible defaults.

Perhaps it would look something like this:

type NucleotideSequence{T<:Nucleotide, N} <: Sequence
    data::Vector{UInt64} # N-bit encoded sequence
    ns::BitVector        # 'N' mask
    part::UnitRange{Int} # interval within `data` and `ns` defining the (sub)sequence
    mutable::Bool        # true if the sequence can be safely mutated

    # true if this was constructed as a subsequence of another sequence or if
    # subsequences were constructed from this sequence. When this is true, we
    # need to copy the data to convert from immutable to mutable
    hasrelatives::Bool
end

Where we add an N to determine the number of bits used to store each nucleotide. So then you could have a getindex for DNASequence{2}, DNASequence{4} and, DNASequence{8}.

Or how about instead of using data::Vector{UInt64} - we define several lightweight immutable types that contain Vector{UInt64}. We could have something like:

immutable 2BitStore <: AbstractBitStore
    array::Vector{UInt64}
end

immutable 4BitStore <: AbstractBitStore
    array::Vector{UInt64}
end

immutable 8BitStore <: AbstractBitStore
    array::Vector{UInt64}
end

type NucleotideSequence{T<:Nucleotide, S <: AbstractBitStore} <: Sequence
    data::S # N-bit encoded sequence
    ns::BitVector        # 'N' mask
    part::UnitRange{Int} # interval within `data` and `ns` defining the (sub)sequence
    mutable::Bool        # true if the sequence can be safely mutated

    # true if this was constructed as a subsequence of another sequence or if
    # subsequences were constructed from this sequence. When this is true, we
    # need to copy the data to convert from immutable to mutable
    hasrelatives::Bool
end

Where each bit store has the methods appropriate for indexing, sub-setting, copying and so on. Or perhaps even that is too much, maybe it can be as simple as:

immutable BitStore{B, N}
    array::Vector{UInt64, N}
end

So a bit store could even be a multidimensional array that stores each element as B bits - this would make it a more general data-structure, useful for quite a lot of things, and for our DNA sequences we'd just need Array{B, 1}. We could then take advantage of method-dispatch and also generated functions where appropriate.

Well, however we do it - I'm for it!

bicycle1885 commented 8 years ago

Yes, setting the number of bits as a type parameter is technically feasible without any performance overhead. But I'm thinking that it will impose a small mental block on us; we always have to care about the bit encoding of the sequence. If a new user sees the weird type parameter they may think that it is difficult to use this package in their work. How often do we really need to keep DNA sequences as small as possible with 2-bit encoding? I think simplicity is the most important thing in this package.

TransGirlCodes commented 8 years ago

I'm just thinking in cases where say @dcjones has done operations that work on each byte doing bit operations, getting some functionality really fast, with 4 bit encoding I would naively expect the function to take say twice as long - all else being equal. If that is true then if I knew my data was unambiguous, and so on - I'd opt for the 2 bit one instead of the 4 bit. But if I knew it wasn't that nice - one genome actually springs to mind, I'd definately want the 4 bit.

I honestly don't think the extra parameterisation would be too much of a mental block - if it meant having three or more parameters then I would agree, but Arrays in julia have two parameters. If we added a parameter for the bit encoding then the nucleotide sequence would also have two parameters. In addition, type parameterisation is one of the big selling point of julia: efficiently allow for several possibilities for the type fed to a field or method, without as much code. If somebody was a Julia user, but complaining about type parameters - especially when there's only two... I'd raise an eyebrow (new and learning users excepted of course!).

timbitz commented 8 years ago

I agree that it is powerful to have nucleotide storage size as a parameter. To deal with new users, maybe just make a typealiased default the more lenient standard, so people without a specific high-performance project in mind don't get errors with unfriendly sequences.

On a separate note, I ran into some trouble serializing Kmer`s ,though I easily solved it by adding:

Base.write{T <: Bio.Seq.Kmer}(io::Base.IOStream, k::T) = Base.write(io, convert(UInt64, k))
Base.write{T <: Bio.Seq.Kmer}(io::GZip.GZipStream, k::T) = Base.write(io, convert(UInt64, k))
Base.read{T <: Bio.Seq.Kmer}(io::Base.IOStream, t::Type{T}) = convert(T, Base.read(io, UInt64))
Base.read{T <: Bio.Seq.Kmer}(io::GZip.GZipStream, t::Type{T}) = convert(T, Base.read(io, UInt64))

Any reason these shouldn't be set in the package itself or why the conversion to UInt64 is a bad solution?

btw, you guys have done some killer work here. I am just starting to make use of Bio.jl now... Cheers!

TransGirlCodes commented 8 years ago

Good idea! typaliasing could definately help people who don't program in Julia (and so don't really get the type parameter thing), but just want to get a job done.

bicycle1885 commented 8 years ago

Fair enough. Making lenient DNA sequences (i.e. 4-bit ambiguous encoding) as the default type would be reasonable, and people who knows that their sequences do not contain ambiguous nucleotides and are willing to exploit the performance gain from 2-bit encoding can use such sequence types. Anyway, we need some feedback from @dcjones.

@timbitz Defining read and write methods for Kmers would be quite useful; the reason why we haven't defined them in Bio.jl would be just because we never used them. Happy to review if you make a pull request.

blahah commented 8 years ago

I was also thinking about ambiguous nucleotides earlier when I was fiddling with the documentation. Ambiguous nucleotides except N are not very common in high-throughput data. I think the default should be 2-bit, but a warning or something alerts the user when a non-ACTGN character is seen, advising them of the 4-bit option.

TransGirlCodes commented 8 years ago

Ah, good point @Blahah I don't know how common ambiguous stuff is in high-throughput, but I can tell you from my experience in evolution, we sometimes do a bit of a hacky thing, instead of using the codes to reflect ambiguity - we might use them to represent heterozygosity - effectively encoding 2 sequences of a diploid in one, assuming the states a certain. I'm personally not a fan, you could represent it in other ways that wouldn't trip people up who don't realise the purpose of the code is het and not ambiguity.

bicycle1885 commented 8 years ago

I think the default should be 2-bit, but a warning or something alerts the user when a non-ACTGN character is seen, advising them of the 4-bit option.

2-bit is not enough to store ACGTN, 3-bit is needed. Though 3-bit is enough in most cases, package developers always have to care about exceptional cases if we accept this incomplete encoding. My point is that just yet another bit, we can store and handle any ambiguous nucleotide completely.

dcjones commented 8 years ago

Changing the representation certainly isn't off the table, but let me describe the motivation for the current encoding.

The intention was not to exclude ambiguity codes, but represent them in a separate sequence type. It's not uncommon for a algorithm to be inapplicable to sequences with ambiguity codes. Keeping separate two representations, one with ambiguity codes and one with ACTGN lets one write code towards the narrow definition without having to perform any runtime checks to determine if there is ambiguity. This could also be addressed by parameterizing sequence types on the alphabet, but I think the overall goal is important. In some other bio libraries there are either a lot functions prefixed with runtime alphabet checks, or otherwise written unsafely with no checks.

The other advantage of 2-bit encoding is efficiency of certain operations. Reduced memory memory is nice, but the primary advantage is that enumerating or indexing k-mers, reverse complement, nucleotide composition, hamming distance, etc, is all significantly faster with this encoding.

The two biggest disadvantages are increased complexity, and slower indexing of individual nucleotides. The latter issue is inherent in the encoding, but it could be improved. Because N is encoded as 0b100, we can avoid that branch by doing

    return convert(T, convert(UInt8, getnuc(T, seq.data, i)) |
                      (convert(UInt8, seq.ns[i]) << 2))

instead of branching on seq.ns[i], though I haven't benchmarked this, and there are still branches due to bounds checking.

bicycle1885 commented 8 years ago

Hm, so 4-bit encoding will break many cool algorithms implemented by @dcjones .

The intention was not to exclude ambiguity codes, but represent them in a separate sequence type.

One idea occurred to me is extending this separated sequence. If we have two 1-bit ambiguity sequences we can hold any ambiguous nucleotides:

type DNASequence
    # unambiguous 2-bit encoding for ACGT
    data::Vector{UInt64}
    # third bit and forth bit to encode ambiguous nucleotides
    third::BitVector
    fourth::BitVector

But much bolder idea is that using the SucVector type implemented in IndexableBitVectors.jl, which is partially mentioned in https://github.com/BioJulia/Bio.jl/issues/109#issuecomment-145683782.

# compressed mask of ambiguous nucleotides
immutable NMask
    blockmask::SucVector
    blocks::Vector{UInt64}
    len::Int
end

type DNASequence
    data::Vector{UInt64}
    third::NMask
    fourth::NMask
end

Here, NMask is a data structure that holds a sparse bit vector. The sparse bit vector is divided into 64-bit blocks and the blockmask field records whether there is at least one ambiguous nucleotide for each block. If there is no ambiguity, we can know that the block can be encoded as 0x0000000000000000, so we don't have to store the block itself; otherwise, we store the block in the blocks field. This data structure assumes that ambiguous nucleotides are very rare and often occur in succession. In such a case, this can efficiently compress ambiguous codes (in human genome, approximately down to 7% space). If there is not ambiguity, the size will be 1/64 * 5/4 ≈ 2%.

One important advantage is that we can quickly (O(1)) check if there is ambiguity within a specific range in a sequence. Many algorithms will run much faster if we can know that there is no ambiguity within a specific region in advance. Moreover, we can know the next location of ambiguous nucleotide in O(log n).

An apparent disadvantage is that NMask is immutable; making a sequence mutable is impossible without copy. I think we can separate the current sequence type into mutable and immutable sequence types since switching mutability of a sequence will not happen so often. In the mutable sequence type, simple 4-bit encoding may be better in terms of random access performance. Even if we can remove branching in separated ambiguity sequence, random access will incur more cache misses.

TransGirlCodes commented 8 years ago

Nice idea on the avoiding branching @dcjones. @bicycle1885 I think I like the NMask if I understand it correctly - blocks contains ONLY the blocks which have N's if I understand correctly, and the SucVector tells you which of the blocks of DNA have Ns, so IF SucVector says, this block of DNA has N's then the N's are looked up in 'blocks'. This could be cool in iteration - if the iterator knows that it has so many positions it doesen't need to ambiguity check then it could speed things up. In addition, it looks like NMask is generalisable i.e. it could be a mask to represent the presence or absence of any characteristic.

bicycle1885 commented 8 years ago

@Ward9250 You are right. In fact, I've found NMask can be thought as a indexable bit vector, so I may add this data structure to the IndexableBitVectors.jl package. I think someone already invented the same idea, and if so, I'd like to use the name.

TransGirlCodes commented 8 years ago

That's awesome, because I think this might be very useful for an upgrade for the data-structure I've been working on that stores SNPs as individual bits. Since a bit is only set to 1 if the minor allele is present (a 0 being the major allele), then it's reasonable to expect there may be many zeros in a given bitarray.

TransGirlCodes commented 8 years ago

Is this issue resolved?

bicycle1885 commented 8 years ago

Not yet. I still believe 4-bit encoding as default is the easiest, quickest, and least trouble solution to this problem. The use case of long sequences like the human genome is limited and adding a bit to the current 3-bit encoding is not significant on modern computers.

TransGirlCodes commented 8 years ago

I think if we're careful, the current functionality and encoding can be kept fully functional even with the 4 bit encoding, as if sequences are parameterised according to their encoding or the alphabet e.g. NucleotideSequence{2} or NucleotideSequence{4} - dispatch will mean what we already have will be used for the 2 bit encoded sequence. So given that implementing 4 bit sequences won't break what we have for 2 bit sequences, I think I'm really open to it, making a branch and giving it a go can't hurt.

bicycle1885 commented 8 years ago

I've started to work on this topic. I can't say when it will finish at the moment, but hope not to take so long time. My plan is to implement 2-bit and 4-bit (and maybe 8-bit) encodings that include all ambiguous nucleotides and gaps. Also, 2-bit + compressed ambiguous bit encoding representation would be useful to hold long reference sequences like human genome. But it may not be enough common to be inculded in Bio.jl, so I'm going to create a new package, say, ReferenceSequences.jl.

TransGirlCodes commented 8 years ago

I think the 2-bit + compressed ambig bit encoding could be very useful, and possibly common for uses other than just as a reference genome. Especially for some work and techniques we are looking at doing at work, which eliminate the need for a reference.

TransGirlCodes commented 8 years ago

@bicycle1885 I'm actually very interested in helping with some of the lower level stuff to do with the vectors of bits now, as I'd like to use them for creating some SNP and variant types, and just like DNASequences, they will benefit from being 2, 3, and 4 bit encoded.

timbitz commented 8 years ago

Just a few thoughts about uses for custom ambiguous character encodings in nucleotide sequences...

In the process of building some indexable splice graphs, I added a few more 'nucleotides', really just edge signals, where the goal was that these extra encodings would instead not overlap any nucleotides I was likely to encounter in real sequences (or at least those supported in the current dna sequences). This way I could map to a sequence containing these boundaries without allowing mapping over a boundary.

Basically I did:

bitstype 8 SGNucleotide <: Bio.Seq.Nucleotide

"SG Left Edge Signal"
const SG_L = convert(SGNucleotide, 0b101)
"SG Right Edge Signal"
const SG_R = convert(SGNucleotide, 0b110)
"SG Start/Stop TX Signal"
const SG_S = convert(SGNucleotide, 0b111)

type NucleotideSequence{T<:SGNucleotide} <: Bio.Seq.Sequence
    data::Vector{UInt64} # 2-bit encoded sequence
    ns::BitVector        # 'N' mask
    ls::BitVector        # 'L' donor splice site mask
    rs::BitVector        # 'R' acceptor splice site mask
    ss::BitVector        # 'S' tx start/stop mask
    part::UnitRange{Int} # interval within `data` and `ns` defining the (sub)sequence
    mutable::Bool        # true if the sequence can be safely mutated
    hasrelatives::Bool
end

While the design here as an extra set of masks is hacky and a fairly poor solution in general (though it does work), I wanted to use this as an example of why one might use additional characters to represent annotations at the sequence level other than the meanings of the ambiguous nucleotides found in IUPAC codes.

I understand that the goal being discussed is to support additional IUPAC ambiguous nucleotides, however do you guys think it is worth considering this in terms of a more general solution that can carry different, but similarly sized sets of ambiguous character encodings? Probably there should be 'const IUPAC' related encodings that are used by default, but it would be cool if these could be overrided by user defined encodings if desired. Perhaps @bicycle1885 is already thinking beyond this with higher bit-encodings to support gaps and things.

For another example, custom non-IUPAC characters could make formalization of the type of SNP work @Ward9250 is talking about less hacky and safer from mis-interpretation when used in sequences already containing ambiguous IUPAC characters. I could imagine another potential use for custom nucleotides could come in to play with representing modified (methylated or pseudouridylated) nucleotides for someone.

Thoughts? Maybe this is mostly a stretch beyond what anyone is actually likely to need or use though....

TransGirlCodes commented 8 years ago

@timbitz I like this, I like the thought of being able to do NuclotideSequence{T} where T is SGNucleotide, or Variant, or IUPAC and so on. Really plays to Julia's parameterisation and method dispatch.

TransGirlCodes commented 8 years ago

@bicycle1885 How far along are you with this? If you could make a PR with some discussion on your design so far, that would be cool. I'm interested in having a bit vector type underlying sequences, that also works for variants and population genetics. I've made some progress on this when I've worked with SNPs so it would be cool to open up a PR and bring these different concepts together - as they can all be done with bits, of various encodings and alphabets.

bicycle1885 commented 8 years ago

Not so much progress. Now I'm thinking about the design of a new sequence data structure, which will unify all sequence types except kmer. Here, I'd like to briefly introduce current my ideas and will make a pull request soon after I think it works fine. In my design, there are three fundamental components:

alphabets
encoders/decoders
sequence data structure.

Alphabets represent domain of biological characters. For example, DNAAlphabet will represent adenine, cytosine, guanine, and thymine. Note that this is conceptually similar to Bio.Seq.Alphabet but will be extensible outside of Bio.jl. One important thing is that the DNAAlphabet type comes with two flavors: unambiguous and ambiguous nucleotides corresponding 2-bit and 4-bit encoding, respectively. That means the DNA alphabet of unambiguous nucleotides can hold only A/C/G/Ts while ambiguous ones can hold all nucleotides defined by IUPAC and gap characters. Likewise, I may define amino acids with some derivatives (e.g. the standard amino acids, the standard + selenocystein and pyrrolysine, etc.).

Encoders and decoders are mapping between biological characters and binary representation. As I listed below, both encode and decode take an alphabet type as its first argument, and encode takes a biological character and returns its corresponding binary representation. decode does this mapping but in opposite direction:

# unambiguous (2-bit) DNA alphabet
encode(::Type{DNAAlphabet{2}}, x::DNANucleotide) -> UInt8
decode(::Type{DNAAlphabet{2}}, x::UInt8) -> DNANucleotide

The sequence data structure is something like the current NucleotideSequence and AminoAcidSequence, but merges these types in one representation. This type is parameterized by an alphabet type:

type BioSequence{A<:Alphabet}
    data::Vector{UInt64}
    part::UnitRange{Int}
    # and other metadata fields follow
end

In order to get/set a character in this type at a position, the biological character will be decoded after reading from/encoded before writing to the underlying binary data. If we carefully match the biological character and the binary representation, these encoding and decoding is just a re-interpretation of bits and result in no operation, which means no overhead in this process. When users want to extend this sequence representation to hold their own characters, just defining 1. alphabet type and 2. encoder/decoder is enough. This is trivial and I think this will meet @timbitz's (and @Ward9250's, maybe) demand.

TransGirlCodes commented 8 years ago

Absolutely loving this idea @bicycle1885, it's not far from what I wanted to do for genotype/variant data. I wanted an array which would store SNPs as bits. Rows would be individuals, and across the row, a haploid could have each snp as one bit, a diploid could have every snp as two bits, a triploid - every 3 bits per snp and so on. I think this might be possible with your idea, as you could have SNP or genotype alphabets. if we can extend it so as data in BioSequence is an Array rather than a Vector. Alternatively we could have BioSequence where data is strictly a vector, and BioSequenceSet, where the array can be 2D, so could be good for multiple sequences. If you get the PR up and running and get a branch going, if you'd like any assistance with anything, be it coding, tests or docs, let me know! :+1:

TransGirlCodes commented 8 years ago

Question: Would it be possible, and even advantageous, to represent amino acid sequences, which currently take a byte per aa (IIRC), as a wavelet tree? Reducing it to some bit vectors partitioning over the alphabet? https://en.wikipedia.org/wiki/Wavelet_Tree. I also came across this: http://arxiv.org/abs/1204.3581

blahah commented 8 years ago

Didn't @bicycle1885 already implement Wavelet Trees in his SoC project?

TransGirlCodes commented 8 years ago

I believe so, and I wondered then if applying them to represent AA's in a compressed manner is more efficient.

EDIT: Ah no, but there is WaveletMatrices. Which, if I understand correctly, is a better implementation, of one representation (level-wise) of Wavelet Trees.

bicycle1885 commented 8 years ago

Yes, it is possible. But random access would be a little bit slower and modifying elements is not possible. So, it wouldn't be a good data structure to represent this kind of general biological sequences.

bicycle1885 commented 8 years ago

As for compression, the gain would be little because amino acids need at least 5 bits (2^5 = 32 > 20 aa).

TransGirlCodes commented 8 years ago

@bicycle1885 Would you be able to push up a branch of the current progress on this tomorrow? I think I'm going to get a few hours spare, I'd like to start giving you a hand, perhaps start on translating the current sequence stuff over to this new system.

bicycle1885 commented 8 years ago

@Ward9250 Thanks! I made a pul request at #135, but it is now a very early stage of development. Maybe it's hard to work on it, but reviews and comments are welcome :).

BioJulia / Bio.jl

[RFC] More nucleotides #122