PacificBiosciences / FALCON_unzip

Making diploid assembly becomes common practice for genomic study
BSD 3-Clause Clear License
30 stars 18 forks source link

BLASR alignment might take long time for contigs that are mostly simple tandem repeats #16

Closed pb-jchin closed 6 years ago

pb-jchin commented 8 years ago

Some contigs are mostly simple repeats. The seeding and filling algorithm used in BLASR has trouble to align the reads to those contigs efficiently. It might make sense to detect shorter contig with low entropy in sequence context and not trying to do phasing on those contigs.

pb-cdunn commented 8 years ago

DBdust could be used to mask low entropy sections, so at least they would not contribute to daligner overlaps.

TANmask could be used to mask tandem repeats. I've added DAMASKER to FALCON-integrate and made it available in our internal mobs build too.

pb-jchin commented 8 years ago

DBdust won't help for the short term as we need the SAM/BAM infrastructure for phasing work. Where the TANmask code? I need to take a look before concluding whether it could help or not.

pb-cdunn commented 8 years ago

Code is in

pb-jchin commented 8 years ago

ok. the problem is that we can't use the Daligner and DAZZ_DB for this yes. We do need something like raw dust masking code for quick (on-fly) detection. I will need to go through i with you some time next explaining the problem better,

mictadlo commented 6 years ago

Any updates on it? Or ngmlr could maybe be used because it might be faster?

pb-cdunn commented 6 years ago

We always run DBdust now. That might help. If you go to the dazzlerblog, you can learn how to analyze the "dust track" to see how much as been masked.

We are also replacing blasr, hopefully within a few weeks.