Closed Arkhaan closed 6 years ago
The final.genome.scf.fasta is the final output of the assembly. The difference between this file and genome.scf.fasta in 9-terminator is that redundant scaffolds are detected and removed.
Thank you for the clarification!
Dear @alekseyzimin, I have checked these files for my example and something just got my attention.
I have checked files: final.genome.scf.fasta, genome.scf.fasta and genome.ctg.fasta. As I expected genome.ctg.fasta file contain no Ns. Also, genome.scf.fasta had less sequences than genome.ctg.fasta and contains Ns (2%). But, surprisingly, file final.genome.scf.fasta contains absolutely no Ns, contains more sequences than genome.scf.fasta and has a N50 value like 10Kb smaller.
I wonder, if the clustering that is done and reduces redundancy is not filtering out or clustering scaffolds containing Ns. Or do you think there is any other conclusion?
P.D.: great job with the masurca assembly! It is amazing and it worked really fast for me! illumina paired-end, nanopore and pacbio for a 1.8Gbp estimated genome size assembly was done in a couple of weeks in a server with 16 CPUs and 256 Gb RAM.
Thank you very much. Jose F.
After running Masurca, I found out that the statistics from CA.mr.41.15.15.0.02.log are computed from the fasta file found in the 9-terminator folder. In Masurca doc it is said the output is final.genome.scf.fasta, however the total length of this file is lower than the one found in the 9-terminator folder.
I'm a bit confused as to which assembly is the correct one, and why there is a different in length (and associated stats) with these two assemblies?
Cheers, Martin Binet