ablab / spades

SPAdes Genome Assembler
http://ablab.github.io/spades/
Other
750 stars 135 forks source link

Bubbles and repeats #662

Closed snayfach closed 3 years ago

snayfach commented 3 years ago

I'm using MetaviralSPAdes to identify circular viral contigs that contain direct terminal repeats (DTRs). My understanding is that DTRs can occur as a result of a cycle in the graph as well as a bubble (due to a repetitive sequence). Is there an easy way to identify and exclude the latter from the (Meta)viralSPAdes output?

Thanks, Stephen

asl commented 3 years ago

Tagging @Dmitry-Antipov

Dmitry-Antipov commented 3 years ago

Hi. Not sure that I understand your question correctly - what exactly do you want to exclude from metaviralSPAdes output?

BTW, we are thinking about adding additional information about potential TDR's location in circular contigs which may correspond to linear viruses with large TDR - such cases can be determined with read coverage. But this is not implemented yet.

snayfach commented 3 years ago

Sorry, let me rephrase. In the MegaHIT output FASTA file, contigs are labeled with a flag of (3=cycle, 2=unconnected linear, and 0=connected linear). I was wondering if there was an easy way of extracting similar information from the meta(viral)SPAdes output. I'm looking for circular contigs and would like to exclude anything that was linear or connected to another contig in the assembly graph.

Regarding the latter point, labeling the start location on circular contigs by read mapping is a great idea. Even better, you could use this information to set the cut point by the assembler so the ends of the contig correspond to the true genome start/stop. Right now an additional step is required to rotate the circular sequence. I've tried the read mapping analysis, and in my experience, you can often clearly see a single position in the genome where there is a massive enrichment of read starting points -- this likely corresponds to the end of a linear genome that has been circularized by the assembler.

Thanks, Stephen

Dmitry-Antipov commented 3 years ago

I'm looking for circular contigs and would like to exclude anything that was linear or connected to another contig in the assembly graph.

For the metaviral pipeline we output the information whether contig is circular in .fasta headers - you can search for "type_circular". To exclude anything that was connected to other contigs in the assembly graph you should also use only contigs with "_cutoff_0" in their headers, but with this metaviralSPAdes will become nearly equivalent to regular SPAdes with some options tweaked.

snayfach commented 3 years ago

Got it, thanks!