broadinstitute / viral-ngs

Viral genomics analysis pipelines
Other
190 stars 67 forks source link

let scaffolder use ambiguous alignments when no unambiguous ones exist #904

Closed notestaff closed 5 years ago

notestaff commented 5 years ago

When scaffolding contigs to a reference, if no unambiguous alignments cover a part of reference, let the scaffolder consider ambiguous ones, when all alignments mostly agree on the sequence. The ambiguity can be the result of a misassembly, where the same reference part appears twice in a contig. Also, in case of a repeat, dropping all ambiguous alignments could leave a hole in the assembly.

dpark01 commented 5 years ago

Oh wow... you delved into some of the more complicated code that actually does some of the science. In particular, I think this prior limitation was simply because I didn't have the time to think about how to implement the general case at the time. Let me stare at it a bit.

Is the only appropriate/needed test around this scenario the part that you modified in the expected output of the pre-existing unit test?

notestaff commented 5 years ago

I can add more tests, that involve actual repeats in the reference. (In the existing test, the ambiguity was due to a misassembly.)

notestaff commented 5 years ago

"is this less about multiple contigs aligning to the same region of the reference and more about multiple alignments from repetitive regions in a single contig hitting the same region of a reference" -- right. The former case is still handled by contig_chooser.

"it's still pretty conservative" -- that's by design; I assumed the original check was meant to prevent some undesirable scenarios, so did not want to relax it too much. The relaxed check seems to be enough for the misassemblies seen in @tomkinsc 's / Kayla's ebov data. But it's possible that more assembly holes could be closed by relaxing the check, and/or by considering ambiguous aligns together with unambiguous ones (since a misjoined contig is really two contigs).

dpark01 commented 5 years ago

Maybe just for record keeping, can you link or describe examples from the EBOV data you're seeing? If this is primarily about the scenario when multiple parts of the same contig align to the same region of a reference, then it's always about repetitive sequence (either because it's biologically repetitive, or because of a misassembly that produces that).

I appreciate the conservatism. That said, the real reason this scenario wasn't supported in the past was just implementation laziness. But if these parameters recover everything we need to recover in the data you're seeing, that's good enough. Or ... would it make any sense to expose them as knobs for the user? Eh... maybe later if the need arises.

So, if you're saying this scenario gets triggered because two contigs got misjoined by the assembler, this PR would help effectively pull them apart?

notestaff commented 5 years ago

"this scenario gets triggered because two contigs got misjoined by the assembler, this PR would help effectively pull them apart" -- two contigs from the same genome part. (And to really pull them apart, maybe we should add both variants to alt_seqs, together with variants from non-ambiguous aligns, and let contig_chooser choose.) For misjoined contigs that pull together two different genome parts (as opposed to repeating the same part twice), there's still the problem that show_tiling uses each contig at most once, so one of the genome parts may get lost. So it may make sense to also detect such contigs and either break them or duplicate them before show_tiling.