BioJulia / GenomeGraphs.jl

A modern genomics framework for julia
https://biojulia.dev
Other
67 stars 8 forks source link

Error Correction #12

Closed ardakdemir closed 4 years ago

ardakdemir commented 5 years ago

I think it would be nice to include some error correction functionalities before generating contigs. This will both enable us to work with real (error containing) data and also allow researchers who would like to do only error correction. Below I list some of the error correction functions I am planning to implement to simplify the de bruijn graph:

ardakdemir commented 5 years ago

@BenJWard during error correction it would be nice to use the coverage of each node (number of occurrence of each kmer). Using this information we can simplify the graph by removing the low-coverage path in a bubble etc. Do you think we should change the SequenceGraphNode type to also include information related to coverage?

ardakdemir commented 5 years ago

@BenJWard also I forgot to ask your opinion about having edge multiplicities? We can extract the number of occurrences of each edge during kmer extraction from reads.

Would getting such information about the reads be useful for the subsequent functionalities that we will implement ?

ardakdemir commented 5 years ago

Dead-end trimming is implemented under the function : delete_tips.

Next step is to pop bubbles in the graph before forming the final contigs. Popping bubbles can be implementing as removing the low-covered branch (contig) in each bubble after building unitigs from kmers. Yet, the new graph may have new contigs so the contig building must be repeated after the removal of the error branches.

To remove the bubbles, may plan is to make use of the 'build_unitigs_from_kmerlist' function to detect unitigs that start and end at the same nodes. Then remove the kmers on the low-covered contig and repeat the unitig building from the resulting kmerlist.

ardakdemir commented 5 years ago

I tried to formulate the bubble popping problem as finding and removing unitigs that start and end at same nodes (kmers). Thus I have updated the 'build_unitigs_from_kmerlist' function which deletes the unitigs that have low coverage and start/end with another unitig. The updated function is available under gsoc/error-correction2 branch with the name 'build_unitigs_from_kmerlist2'.

However, constructing the unitigs and then removing them may not find all the contigs after bubble popping especially in the case of nested bubbles. Thus, I am planning to switch to another design where we delete the bubbles using the tour bus algorithm (similar to the approach taken by Velvet).

Another approach (taken by Arapan) is to repeat the path collapsing/bubble_removing steps multiple times until no changes are made in the graph.