alekseyzimin / masurca

GNU General Public License v3.0
246 stars 35 forks source link

final.genome.scf.fasta vs genome.scf.fasta #72

Open JFsanchezherrero opened 6 years ago

JFsanchezherrero commented 6 years ago

Dear @alekseyzimin I wrote a few days ago in a closed issue and I wonder if you could not read it https://github.com/alekseyzimin/masurca/issues/47.

I also checked genome.scf, genome.ctg and final.genome.scf files for my example and something just got my attention.

As I expected genome.ctg.fasta file contain no Ns. Also, as expected, genome.scf.fasta had less sequences than genome.ctg.fasta and contains Ns (~2%). But, surprisingly, file final.genome.scf.fasta contains absolutely no Ns, contains more sequences than genome.scf.fasta and has a N50 value like 10Kb smaller, in fact, it looks like the genome.ctg.fasta file.

I have attached an image showing my genome statistics for each file and BUSCO values. Capture gaps are string of >50 consecutive Ns. Take into account that this genome belongs to a subphylum which is under represented in BUSCO data sets so it generates that low values.

stats

I wonder, if the clustering that is done and reduces redundancy between genome.scf and final.genome.scf is not filtering out or clustering scaffolds containing Ns. Or do you think there is any other conclusion?

P.D.: great job with the masurca assembly! It is amazing and it worked really fast for me! illumina paired-end, nanopore and pacbio for a 1.8Gbp estimated genome size assembly was done in a couple of weeks in a server with 16 CPUs and 256 Gb RAM.

Thank you very much. Jose F.

alekseyzimin commented 6 years ago

Hi,

Thank you for your comment and question. In the new MaSuRCA 3.2.8 final.genome.scf.fasta is indeed a contig file. In 3.2.8 I close all gaps in scaffolds that are spanned by long reads (Pacbio/Nanopore) and the gaps that would not close break the scaffolds because these may be misassemblies. By design of the methos all scaffolds gaps must be spanned by long reads.

The final file to use is final.genome.scf.fasta.

Best, Aleksey

On Wed, Oct 10, 2018 at 12:18 PM Jose Francisco Sanchez-Herrero < notifications@github.com> wrote:

Dear @alekseyzimin https://github.com/alekseyzimin I wrote a few days ago in a closed issue and I wonder if you could not read it https://github.com/alekseyzimin/masurca/issues/47 http://url.

I also checked genome.scf, genome.ctg and final.genome.scf files for my example and something just got my attention.

As I expected genome.ctg.fasta file contain no Ns. Also, as expected, genome.scf.fasta had less sequences than genome.ctg.fasta and contains Ns (~2%). But, surprisingly, file final.genome.scf.fasta contains absolutely no Ns, contains more sequences than genome.scf.fasta and has a N50 value like 10Kb smaller, in fact, it looks like the genome.ctg.fasta file.

I have attached an image showing my genome statistics for each file and BUSCO values. Capture gaps are string of >50 consecutive Ns. Take into account that this genome belongs to a subphylum which is under represented in BUSCO data sets so it generates that low values.

[image: stats] https://user-images.githubusercontent.com/20244642/46750350-071a1a00-ccb8-11e8-8114-27ebea86adb8.png

I wonder, if the clustering that is done and reduces redundancy between genome.scf and final.genome.scf is not filtering out or clustering scaffolds containing Ns. Or do you think there is any other conclusion?

P.D.: great job with the masurca assembly! It is amazing and it worked really fast for me! illumina paired-end, nanopore and pacbio for a 1.8Gbp estimated genome size assembly was done in a couple of weeks in a server with 16 CPUs and 256 Gb RAM.

Thank you very much. Jose F.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/alekseyzimin/masurca/issues/72, or mute the thread https://github.com/notifications/unsubscribe-auth/AZ9zHSpR24h-gzRCp1jckpVmzIFRNrt9ks5ujh29gaJpZM4XVxVE .

-- Dr. Alexey V. Zimin Associate Research Scientist Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA (301)-437-6260 http://www.genome.umd.edu http://masurca.blogspot.com

JFsanchezherrero commented 6 years ago

Thanks for such a quick response!

I am using MaSuRCA 3.2.8 by the way.

What intrigues me is that I obtain better statistics from BUSCO in the genome.scf.fasta file. You can see much more complete (single and duplicated) and fragmented BUSCOs in that step and that is always a sign of completeness and long assemblies. In fact, there is a difference of 10Kb on N50.

I have a couple of questions more, so before that step is there not a gap filling with Nanopore?

Also, I wonder why you are breaking all gaps. Although it is true they could be missamblies, if you are using a combination of jumping libraries, long reads and paired-end short reads you expect to have the better and continuous assembly even it contains some gaps. In fact, for these case, I was only 16k gaps and 1.6% Ns.

Thanks Jose F

alekseyzimin commented 6 years ago

Hi,

The gap filling step takes place in 10-gapclose. The gap-filled contigs, before filtering for redundancy are in 10-gapclose/genome.scf.fasta. The way I arrive to final.genome.scf.fasta, I map the contigs against themselves and remove contigs that are contained in other contigs with high identity (>=97%). The reason for that is twofold: (i) to remove haplotype copies in heterozygous regions, and (ii) to fix assembler artefacts where assembler for some reason decided to split a contig into two with the smaller one contained in the larger one to produce spurious redundancy. This procedure is not perfect, but it does more good than damage. When doing BUSCo analysis on 10-gapclose/genome.scf.fasta, you will likely see the larger number of duplicated BUSCOs than in final.genome.scf.fasta.

Best, Aleksey

On Wed, Oct 10, 2018 at 12:42 PM Jose Francisco Sanchez-Herrero < notifications@github.com> wrote:

Thanks for such a quick response!

I am using MaSuRCA 3.2.8 by the way.

What intrigues me is that I obtain better statistics from BUSCO in the genome.scf.fasta file. You can see much more complete (single and duplicated) and fragmented BUSCOs in that step and that is always a sign of completeness and long assemblies. In fact, there is a difference of 10Kb on N50.

I have a couple of questions more, so before that step is there not a gap filling with Nanopore?

Also, I wonder why you are breaking all gaps. Although it is true they could be missamblies, if you are using a combination of jumping libraries, long reads and paired-end short reads you expect to have the better and continuous assembly even it contains some gaps. In fact, for these case, I was only 16k gaps and 1.6% Ns.

Thanks Jose F

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/alekseyzimin/masurca/issues/72#issuecomment-428644304, or mute the thread https://github.com/notifications/unsubscribe-auth/AZ9zHYqYZdK6slIAW03pgLB2noTAOq3Nks5ujiNbgaJpZM4XVxVE .

-- Dr. Alexey V. Zimin Associate Research Scientist Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA (301)-437-6260 http://www.genome.umd.edu http://masurca.blogspot.com