BioJulia / Bio.jl

[DEPRECATED] Bioinformatics and Computational Biology Infrastructure for Julia
http://biojulia.dev
MIT License
261 stars 65 forks source link

[RFC] Proposal for Julia Summer of Code 2015 #54

Closed bicycle1885 closed 9 years ago

bicycle1885 commented 9 years ago

You may already know about Julia Summer of Code (JSoC) 2015. This seems to be something like the Google Summer of Code: students develop open source projects under support of their mentor funded by Google, Inc. Please beware that the application deadline is very soon (June 1)!

More detailed announcement is at the website:

Thanks to a generous grant from the Moore Foundation, we are happy to announce the 2015 Julia Summer of Code (JSoC) administered by NumFocus. We realize that this announcement comes quite late in the summer internship process, but we are hoping to fund six projects. The duration of JSoC 2015 will be June 15-September 15. Last date for submitting applications is June 1.

http://julialang.org/blog/2015/05/jsoc-cfp/

I'm thinking to apply to the JSoC as a subproject of BioJulia if I can find a good mentor and a project here. @dcjones is so kind that he said he would be able to be my mentor (thanks a lot!), but I haven't got a concrete project idea at the current moment. So if you have a great project in your head and thank this is a good chance to make it, please let me know! That is the reason why I opened this issue.

If not, I have several abstract project ideas for the candidates:

I have to confess that I'm not an expert developer of these fields, but I'm a heavy user of these tools and will continue to use for a long time.

I'm waiting for your involvement of the discussion.

Thanks.


Personal background:

blahah commented 9 years ago

I think this is an excellent idea and I think you won't find a better mentor than @dcjones.

Of your three ideas, I think alignment parsers and aligners would have the most widespread interest.

I'll throw in another idea:

Building efficient data structures for working with kmers. These would be the foundation for many different applications of BioJulia to NGS, including assembly, read processing like error correction, and various kinds of alignment. Example data structures:

blahah commented 9 years ago

Also I should add that anything I can do to support your application, I will happily do. I think this would be great for BioJulia.

bicycle1885 commented 9 years ago

@Blahah Thank you for your comment. You all are so kind!

Now I'm wondering whether implementing parsers may be less appealing to outsiders, even though we all know it is a extremely important job. Since it's hard to find data structures and algorithms for sequence analysis and searching in the Julia ecosystem, developing such tools would be a promising proposal.

To add some to @Blahah's list,

We have to pick one or a few topics to make a proposal. I personally like BWT/FM-index because fast in-memory searching would be one of the most universally applicable tools in bioinformatics. Wavelet tree/matrix is also interesting and you can find numerous applications in sequence and graph data processing.

dcjones commented 9 years ago

I do like the idea of implementing some useful algorithms and data structures. That could pave the way for using Julia to implement "heavy-duty" tools like aligners and assemblers.

A good FM-index or suffix array implementation would be great. If that were implemented in generic way it would be broadly useful for full-text searching beyond just Bio.jl. Hashing k-mers efficiently is also very useful to both assembly and alignment. Wavelet trees and local sensitive hashing are compelling, but more narrowly applicable than FM-index or k-mer indexing. (Though I may be wrong in that assessment.)

TransGirlCodes commented 9 years ago

Am I right in thinking a Wavelet tree is used for compressing suffix arrays?

bicycle1885 commented 9 years ago

@Ward9250 Wavelet trees will be used to implement a rank query: given a string S, rank(c, S, i) returns the number of times character c appears in the prefix S[1:i]. A Wavelet tree enables to query rank in O(log σ) time, where σ is the number of alphabets, according to Compressed Text Indexes: From Theory to Practice, § 2.3. But we might need to store a compressed suffix array (CSA) to restore positions in the original text and the CSA can be represented in a succinct bit vector, which is a building block of a wavelet matrix, though I'm not sure at the current moment.

Wavelet matrices seem to be an alternative representation of Wavelet trees, but are said to be simpler and faster than Wavelet trees (The Wavelet Matrix).

The following snippet would make clearer what we need. We have to speed up the rank operation in O(log σ) time and store a BWTed text and suffix array in a size as small as possible while keeping fast operations. I'm now surveying papers and writing the proposal. I'm going to finish it as soon as possible. When finished, I'll post it here and then could you please correct it if you have time (the deadline is too harsh!)?

# FM-index searching for ASCIIString
# Reference: PAOLO, et al. "Compressed Text Indexes: From Theory to Practice"

# https://github.com/quinnj/SuffixArrays.jl
using SuffixArrays

# Return the interval SA[sp,ep] of text suffixes prefixed by p; see "Algorithm FM-count"
# Arguments:
#   t: BWTed text
#   p: prefix
#   c: character counts
function countFM(t, p, cnt)
    sp, ep = 1, endof(t)
    i = endof(p)
    while sp ≤ ep && i ≥ 1
        c = p[i]
        sp = cnt[c] + rank(c, t, sp - 1) + 1
        ep = cnt[c] + rank(c, t, ep)
        i -= 1
    end
    return ifelse(sp > ep, 0:-1, sp:ep)
end

function bwt(t)
    sa = suffixsort(t)
    # note that the indices stored in a saffix array is 0 based
    ord = [sa.index[i] > 0 ? sa.index[i] : endof(t) for i in 1:endof(t)]
    return t[ord], sa
end

function count(t)
    cnt = zeros(Int, 128)
    for c in t
        cnt[convert(Int, c)+1] += 1
    end
    return cumsum(cnt)
end

# poor man's rank query (should be implemented in a Wavelet tree/matrix)
function rank(c, t, i)
    @assert 0 <= i <= endof(t)
    n = 0
    for i′ in 1:i
        if t[i′] == c
            n += 1
        end
    end
    return n
end

let
    s = "banana"
    t, sa = bwt(s)
    cnt = count(t)
    @show t sa
    @show countFM(t, "ban", cnt)
    @show countFM(t, "an", cnt)
    @show countFM(t, "ann", cnt)

    println("-"^60)

    s = "abracadabra"
    t, sa = bwt(s)
    cnt = count(t)
    @show t sa
    @show countFM(t, "a", cnt)
    @show countFM(t, "bra", cnt)
    @show countFM(t, "cd", cnt)
end

Output:

t => "nnbaaa"
sa => SuffixArray{ASCIIString,Int8}("banana",6,Int8[5,3,1,0,4,2])
countFM(t,"ban",cnt) => 4:4
countFM(t,"an",cnt) => 2:3
countFM(t,"ann",cnt) => 0:-1
------------------------------------------------------------
t => "rdarcaaaabb"
sa => SuffixArray{ASCIIString,Int8}("abracadabra",11,Int8[10,7,0,3,5,8,1,4,6,9,2])
countFM(t,"a",cnt) => 1:5
countFM(t,"bra",cnt) => 6:7
countFM(t,"cd",cnt) => 0:-1
bicycle1885 commented 9 years ago

I've wrote a draft of my proposal below and a demonstration of my idea at https://github.com/bicycle1885/WaveletMatrices.jl. The format of application is not specified, so I mimicked the format of the Google Summer of Code. I'm sorry to be late to post it.


Efficient data structures and algorithms for sequence analysis in BioJulia.

Applicant: Kenta Sato (GitHub: @bicycle1885) Mentor: Daniel C. Jones (GitHub: @dcjones)

Summary

Large-scale sequence analysis is a fundamental part of recent bioinformatics. High-throughput DNA sequencers generate massive amount of short DNA fragments from biological samples and these fragments should be located where they come from in other long sequences for downstream analyses. In order to search the matching locations of these short sequences in a long sequence like the human genome, efficient data structures and algorithms are indispensable. In this project, under the BioJulia project (https://github.com/BioJulia), we will develop a full-text search and alignment library based on the FM-index[4]. The result will enhance the use of the Julia language for sequence analysis in the bioinformatics community and hence encourage more developers to write programs in the related fields of biology.

What I will make

I will implement sequence searcher/aligner based on the FM-index to handle genome-scale sequence data from high-throuput sequencing. This would be the most commonly used technique to build fast and compact indices for full-text search, adopted by Burrows-Wheeler Aligner (BWA)[2] and bowtie2[1] software when aligning massive short sequences to a huge genome sequence. The FM-index search is run over a text of Burrow-Wheeler transform (BWT)[3] of the original text and the transformed text need to be stored in a compact data structure that supports fast occurrence count of characters[5]. To fulfill the demands, I will develop Julia packages about succinct data structures: bit vector and Wavelet tree/matrix. These data structures enable occurrence count of characters for a sequence of arbitrary length in constant time.

Since the BioJulia project already has basic data structures to represent sequences, I will implement these tools on the top of them for seamless integration. As for biological sequences, the alignment algorithm should allow small mismatches or gaps to be robust against noises and genetic variants. I will consult other implementations to handle short mismatches well.

Demonstration

I have made two new packages required to create the search engine in order to show the implementability of this project:

IndexedBitVectors.jl is for a data structure like a bit vector, which allows some operations like bit count in constant time with small memory overhead. WaveletMatrices.jl is built on the top of IndexedBitVectors.jl, which supports to store 8bit elements rather boolean elements.

Using these data structures and the FM-index, I implemented a DNA fragment searcher in the human genome. It can calculate the number of occurrences of the query fragment in a chromosome for 10-100μ seconds while linear search requires nearly x1000 time (10-100m seconds).

For more details, please refer to the following links: https://github.com/bicycle1885/WaveletMatrices.jl#demo https://github.com/bicycle1885/WaveletMatrices.jl/blob/master/fmindex.jl

Plan

Impact

This will be a meaningful step toward sequence analysis because sequence search and alignment are indispensable building blocks of bioinformatics. Almost all of the existing programs related to this kind of sequence analysis are implemented in C, C++, and Java, but simplicity and dynamic nature of Julia would be appealing if comparable performance can be achieved.

About me

I am a graduate school student at the University of Tokyo, Japan, studying bioinformatics and human genetics. I have about 1.5 years’ experience of intensive Julia programming and have developed several packages in Julia:

I am one of founders of the JuliaTokyo user group and we held meetups for three times in 2014 and 2015, the total turnout was about 150.

References

  1. Langmead, Ben, et al. "Ultrafast and memory-efficient alignment of short DNA sequences to the human genome." Genome Biol 10.3 (2009): R25.
  2. Li, Heng, and Richard Durbin. "Fast and accurate short read alignment with Burrows–Wheeler transform." Bioinformatics 25.14 (2009): 1754-1760.
  3. Burrows, Michael, and David J. Wheeler. "A block-sorting lossless data compression algorithm." (1994).
  4. Ferragina, Paolo, and Giovanni Manzini. "Opportunistic data structures with applications." Foundations of Computer Science, 2000. Proceedings. 41st Annual Symposium on. IEEE, 2000.
  5. Ferragina, Paolo, et al. "Compressed text indexes: From theory to practice."Journal of Experimental Algorithmics (JEA) 13 (2009): 12.
blahah commented 9 years ago

Looks great @bicycle1885 - can you put it somewhere that I can make a PR or inline comments? I want to suggest some edits.

bicycle1885 commented 9 years ago

@Blahah Thank you a lot! I've make a new repository here: https://github.com/bicycle1885/JSoC2015 and Proposal.md is the text. Native check and other edits of my proposal are really valuable to me.

bicycle1885 commented 9 years ago

Hi, I've almost finished my writing (at least in my mind). I'm going to send an email for application about 6 PM in Eastern Daylight Time, 6 hours behind from the deadline (the end of June 1st, the timezone is not specified :sweat_smile:). It's about in 11 hours from now. So if you have comments/edits, please let me know until this time.

Thank you for your contributions :smile:

bicycle1885 commented 9 years ago

I've sent the proposal to the JSoC organization! Let's wait for the response.

I really appreciate your comments and contributions. Thank you!

bicycle1885 commented 9 years ago

Thanks to your commitment of my proposal, my project has been accepted by the JSoC 2015 program. Especially @dcjones and @Blahah were kind to correct my poor proposal soon, thank you again! I noticed this announcement today, but my supernatural power informed me of the acceptance in advance and I already prepared plane tickets for JuliaCon 2015 :grin:

I'm looking forward to seeing @dcjones and other developers at JuliaCon!

dcjones commented 9 years ago

Congratulations @bicycle1885! I'll see you in Boston in a couple weeks.