ekg / edyeet

base-accurate DNA sequence alignments using edlib and mashmap2
MIT License
33 stars 3 forks source link

Merging segment alignments with edyeet and wfmash #5

Open brettChapman opened 3 years ago

brettChapman commented 3 years ago

Hi Erik

I have a question about the -M parameter for edyeet and wfmash. Since it produces a better overall mapping by merging the alignments of each segment, wouldn't it be better to use -M every time? Or is it way too costly for whole chromosomes? Whats the general increase in run time using -M for whole genomes/chromosomes? Does leaving out the -M parameter leave artifacts in the alignments/variants? is it something only worth doing for smaller gene regions? Would it be worthwhile using the -M parameter as the final step for the final production level pangenome graph, after all parameter testing is done? Thanks.

ekg commented 3 years ago

I have been experimenting with it. There are two main problems. First, it tends to use a lot more memory and time to align very long segments, which makes the runtime a bit hard to control. Second, -M doesn't provide the same kinds of numerical guarantees as -n, or at least it is incompatible with the formulation of -n. This makes the resulting alignments somewhat ragged in length and quality. Both of these issues could be mitigated various ways. For instance, we could merge the alignments but still align small fragments to keep memory down.

I wonder if improvements to the chaining algorithm could resolve this. A dynamic programming model that does a kind of pseudo alignment might help. That plus breaking up the alignment somehow would probably make this work.

On Fri, Oct 16, 2020, 04:33 Brett Chapman notifications@github.com wrote:

Hi Erik

I have a question about the -M parameter for edyeet and wfmash. Since it produces a better overall mapping by merging the alignments of each segment, wouldn't it be better to use -M every time? Or is it way too costly for whole chromosomes? Whats the general increase in run time using -M for whole genomes/chromosomes? Does leaving out the -M parameter leave artifacts in the alignments/variants? is it something only worth doing for smaller gene regions? Would it be worthwhile using the -M parameter as the final step for the final production level pangenome graph, after all parameter testing is done? Thanks.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ekg/edyeet/issues/5, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEI7ZZLVE3V476VBXRLSK6WHZANCNFSM4SSYAS5A .

brettChapman commented 3 years ago

Thanks for the explanation. For now I'll leave -M off unless looking at small regions, and later include it down the line if improvements to the algorithm mitigate the costs of using it.