BioJulia / GenomeGraphs.jl

A modern genomics framework for julia
https://biojulia.dev
Other
67 stars 8 forks source link

Basic De-Bruijn Graph type #1

Closed TransGirlCodes closed 4 years ago

TransGirlCodes commented 5 years ago

We need to implement a basic de-bruijn graph type, along with constructors and basic manipulation methods.

ardakdemir commented 5 years ago

I have pushed some initial functions to the gsoc branch for finding overlaps and constructing a sequence graph. For the time being it is just a SequenceGraph that is constructed as a de-bruijn graph.

Next I will start implementing de-bruijn graphs as a type on its own

ardakdemir commented 5 years ago

We have implemented a DeBruijnGraph type which supports core functionalities such as contructing the graph from a given set of kmer along with some query functions such as indegree, outdegree counting. The final version is available in the gsoc branch.

TransGirlCodes commented 5 years ago

@ardakdemir, here is the pseudocode for constructing a De-Bruijn Graph from a set of DNAKmer{K}, I was going through last time we Skyped.

Make an empty graph

Make two empty Vector{Tuple{DNAKmer{K-1}, Int64}}. Call one `kmer_ovl_bw_nodes`, call the other one `kmer_ovl_fw_nodes`. 

For each kmer in kmerset...
    Make the canonical form of the kmer. (there should be a canonical method in BioSequences).
    Add the canonical kmer to the nodes of the graph, and note it's ID.
    Take the prefix (k-1) of the canonical kmer.
    If the prefix is canonical, push the tuple (canonical(prefix), +nodeid) to `kmer_ovl_fw_nodes`.
    Else push the tuple(canonical(prefix), +nodeid) to the `kmer_ovl_bw_nodes`.
    Take the suffix (k-1) of the canonical kmer.
    If the suffix is canonical, push the tuple (canonical(suffix), -nodeid) to `kmer_ovl_bw_nodes`.
    Else push the tuple (canonical(suffix), - nodeid) to `kmer_ovl_fw_nodes`.
end

Sort `kmer_ovl_fw_nodes`.
Sort `kmer_ovl_bw_nodes`.

for kbn in kmerovl_bw_nodes
        for kfn in kmerovl_fw_nodes
            if first(kbn) == first(kfn)
                add link to graph with source=last(kbn), destination=last(kfn) and distance=-k+1
            end
        end
end

Return the graph.
ardakdemir commented 5 years ago

Dear Ben,

Thanks a lot for the pseudocode. I will go over it and try to update the constructor.

I will let you know if I have any questions.

Arda

ardakdemir commented 5 years ago

Hey !

The pseudocode was really easy to understand! Now I think I got the reason behind using (+) and (-) ends and canonical forms. I think this is a good way to represent dbg in a compact form.

I have implemented the dbg constructor according to the pseudocode. I have pushed the final version to the gsoc branch.

You can check it out from here : https://github.com/BioJulia/BioSequenceGraphs.jl/tree/gsoc

I have done some initial tests and everything seems correct, but still there might be some errors/bugs.