bonsai-team / matam

Mapping-Assisted Targeted-Assembly for Metagenomics
GNU Affero General Public License v3.0
19 stars 9 forks source link

Allow dynamic reduction of the coverage for highly covered references #34

Closed loic-couderc closed 6 years ago

loic-couderc commented 7 years ago

At the moment, for huge datasets, some species can be highly covered (> 10000x). Thus, computing the overlap graph is impossible due to the complexity of the algorithm used: O(n²), with n being the read coverage.

To be able to achieve the task of assembling such datasets, we can implement different kind of dynamic coverage reduction (simple to hardest):

loic-couderc commented 6 years ago

I have implemented one possible way for the second sampling method ("randomly sample highly covered regions"). The sampling is done one position at a time (left to right) until we reach the threshold. The major drawback of this method is to introduce low covered regions as we can see in the following figure where the coverage before and after the sampling is shown for all positions of one reference. The red dashed line correspond to the threshold applied. image See coverage_ fb17af4.pdf for more examples.

loic-couderc commented 6 years ago

The last commit try to fix the low covered regions:

image See coverage_1a5c7b6.pdf for more details.

loic-couderc commented 6 years ago

Evaluation of the method from commit 26c46f8: coverage_26c46f8.pdf Evaluation of the method from commit e24001d: coverage_e24001d.pdf