Closed ardakdemir closed 4 years ago
@BenJWard during error correction it would be nice to use the coverage of each node (number of occurrence of each kmer). Using this information we can simplify the graph by removing the low-coverage path in a bubble etc. Do you think we should change the SequenceGraphNode type to also include information related to coverage?
@BenJWard also I forgot to ask your opinion about having edge multiplicities? We can extract the number of occurrences of each edge during kmer extraction from reads.
Would getting such information about the reads be useful for the subsequent functionalities that we will implement ?
Dead-end trimming is implemented under the function : delete_tips.
Next step is to pop bubbles in the graph before forming the final contigs. Popping bubbles can be implementing as removing the low-covered branch (contig) in each bubble after building unitigs from kmers. Yet, the new graph may have new contigs so the contig building must be repeated after the removal of the error branches.
To remove the bubbles, may plan is to make use of the 'build_unitigs_from_kmerlist' function to detect unitigs that start and end at the same nodes. Then remove the kmers on the low-covered contig and repeat the unitig building from the resulting kmerlist.
I tried to formulate the bubble popping problem as finding and removing unitigs that start and end at same nodes (kmers). Thus I have updated the 'build_unitigs_from_kmerlist' function which deletes the unitigs that have low coverage and start/end with another unitig. The updated function is available under gsoc/error-correction2 branch with the name 'build_unitigs_from_kmerlist2'.
However, constructing the unitigs and then removing them may not find all the contigs after bubble popping especially in the case of nested bubbles. Thus, I am planning to switch to another design where we delete the bubbles using the tour bus algorithm (similar to the approach taken by Velvet).
Another approach (taken by Arapan) is to repeat the path collapsing/bubble_removing steps multiple times until no changes are made in the graph.
I think it would be nice to include some error correction functionalities before generating contigs. This will both enable us to work with real (error containing) data and also allow researchers who would like to do only error correction. Below I list some of the error correction functions I am planning to implement to simplify the de bruijn graph:
Trimming dead-end tips : We remove all tips with no outgoing edge. Tip refers to an edge from a node (with multiple outgoing edges), where the destination of the edge has no outgoing edges. These nodes are treated as errors that occur at the end of a read.
Popping bubbles : Two path that diverge from a single node and then merge into another node. In such a case one of the paths are removed from the graph. Usually the removed path has a low coverage (depth) and treated as an error that occurred in the middle of a read.
Removing chimeric edges : Edges that cross across two simple paths. Such edges usually have low coverage and removed from the graph.