broadinstitute / viral-assemble

viral-ngs: genome assembly and scaffolding
Other
8 stars 2 forks source link

Evaluate RagTag for gap filling scaffolded de novo assemblies #32

Open tomkinsc opened 1 year ago

tomkinsc commented 1 year ago

@golu099 brought up GAPPadder as a tool to potentially replace Gap2Seq for filling gaps in seq coverage between scaffolded contigs of de novo assemblies. We should evaluate it, potentially using synthetic read sets generated by something like wgsim—likely combining reads from multiple similar genomes to approximate sequencing a sample from a mixed viral population.

We may also want to take a look at some of the other scaffolding/gap filling tools, like RagTag (NB: RagTag fills gaps in de novo assemblies using sequence data from assemblies, not from reads).

dpark01 commented 3 months ago

Some notes from @ammaraziz on slack about ragtag:

Use -r to infer the gaps or it will add 100bp gaps, set min gap length with -g to 2 when inferred, there is an issue relating to gap lenght of 1 (for some reason absolute minimum is 2). Everything else on default, I used minimap2 as the aligner. One very annoying issue is that if your contigs overlap, ragtag will add 100bp gap (irrespective of the above settings).

ammaraziz commented 3 months ago

I did some more testing yesterday and ran into this issue again:

One very annoying issue is that if your contigs overlap, ragtag will add 100bp gap (irrespective of the above settings)

I now think this behavior changes depending on the -r + -g options. If they're not set, ragtag adds 100bp gap. If they are set, it will not scaffold the second overlapping fragment.

An aside: The overlaps occur with spades (skesa doesn't have this problem). It's related to the max kmer in the k range, where the overlap is kmer-2 in length. I keep increasing the kmer size (up to 103 now) but this issue crops it's ugly head. My samples could contain mixtures (either quasi viruses or coinfection) but its hard to tell.

Hope that helps!