alekseyzimin / masurca

GNU General Public License v3.0
244 stars 35 forks source link

configuration issue leading to poor results? #79

Open waalkes opened 5 years ago

waalkes commented 5 years ago

I am paired end assembling a known genome - Salmonella enterica strain LT2 (CP014051.2) ~4.9M in size

I am assembling at 20x,50x,100x,300x and 600x coverage.

The results I am getting are strange. I am using the basic configuration other than read file location and since I have no jumping libraries I commented out the JUMP line.

At 20x the assembler scaffolds like crazy creating an assembly that is 6.8M with 1.8M bases of scaffolding(NNNs). This is without any jumping libraries. At 50x-600x the assembly size is much more reasonable (4.94-5.1M). Many times we only have 20x coverage so understanding if I need to change parameters to avoid this massive scaffolding is important. Thanks

Lanilen commented 5 years ago

In general, assemblers that do their own estimates for coverage/kmer distribution/etc. tend to do this job poorly when coverage is either low or extremely high. Try setting the kmer manually to a small value with the 20x data (and I mean really small, 31 or so), then manually move it up a few steps at a time to study the assembler's behaviour.

The extra gappiness and excess NNN stretches could be due to failing to properly estimate insert sizes for the pair-end library if the low-coverage contigs are improperly constructed. If that's the case, a small kmer is an easy fix. If not, then my advice won't help you at all...