alekseyzimin / masurca

GNU General Public License v3.0
245 stars 35 forks source link

Difference bewteen "final.genome.scf.fasta" and "9-terminator/genome.scf.fasta" #47

Closed Arkhaan closed 6 years ago

Arkhaan commented 6 years ago

After running Masurca, I found out that the statistics from CA.mr.41.15.15.0.02.log are computed from the fasta file found in the 9-terminator folder. In Masurca doc it is said the output is final.genome.scf.fasta, however the total length of this file is lower than the one found in the 9-terminator folder.

I'm a bit confused as to which assembly is the correct one, and why there is a different in length (and associated stats) with these two assemblies?

Cheers, Martin Binet

alekseyzimin commented 6 years ago

The final.genome.scf.fasta is the final output of the assembly. The difference between this file and genome.scf.fasta in 9-terminator is that redundant scaffolds are detected and removed.

Arkhaan commented 6 years ago

Thank you for the clarification!

JFsanchezherrero commented 6 years ago

Dear @alekseyzimin, I have checked these files for my example and something just got my attention.

I have checked files: final.genome.scf.fasta, genome.scf.fasta and genome.ctg.fasta. As I expected genome.ctg.fasta file contain no Ns. Also, genome.scf.fasta had less sequences than genome.ctg.fasta and contains Ns (2%). But, surprisingly, file final.genome.scf.fasta contains absolutely no Ns, contains more sequences than genome.scf.fasta and has a N50 value like 10Kb smaller.

I wonder, if the clustering that is done and reduces redundancy is not filtering out or clustering scaffolds containing Ns. Or do you think there is any other conclusion?

P.D.: great job with the masurca assembly! It is amazing and it worked really fast for me! illumina paired-end, nanopore and pacbio for a 1.8Gbp estimated genome size assembly was done in a couple of weeks in a server with 16 CPUs and 256 Gb RAM.

Thank you very much. Jose F.