Algorithm for secondary de novo genome assembly guided by closely related references
Extended Output #31

Thanks for developing this. I had a question about the output files and, after reading the README, I am still unsure on the expected output.

I am using PE150 MiSeq reads. Performed a SPAdes assembly (this output is *.scaffolds.fasta). These genomes are highly fragmented but >99% ANI to a reference genome that I have previously sequenced fully with PACBIO (complete genome; 3.77Mb). However, when I use AlignGraph, the output files confuse me.

One example:

Sequence_ID Total_Contigs   Genome_length   Largest_Contig  n50 GC_Percent
Desert-2-3.extended 16  3335782 1108581 343422  70
Desert-2-3.remain   19  707214  201094  182646  71
Desert-2-3.scaffolds    73  3740981 508547  119510  71

This appears to me that I would need to concatenate the extended.fasta with the remaining.fasta file to get the desired genome? Any clarification would be great.

Here is the command I am using:

AlignGraph --read1 $OUTDIR/${mate}_R1_001.fasta --read2 $OUTDIR/${mate}_R2_001.fasta \
--fastMap --contig $genome.scaffolds.fasta --genome $REFGENOME \
--distanceLow 550 --distanceHigh 1550 \
--extendedContig $genome.extended.fa --remainingContig $genome.remain.fa 

Thank you!

Sorry for reply late. Yes, the extended.fasta file contains extended contigs by AlignGraph, and the remaining.fasta file contains the not extended contigs.


So does it mean to get final assembly once has to combine sequences in remain and extended contig files?