Nextomics / NextDenovo

Fast and accurate de novo assembler for long reads
GNU General Public License v3.0
350 stars 52 forks source link

chimeric mitochondrial-nuclear scaffolds #162

Open kanchond opened 1 year ago

kanchond commented 1 year ago

Question or Expected behavior I have generated genome assemblies for two different species of butterfly. The assembly sizes are ~700-800Gb after running purge_dups. In both assemblies I find that there is a large chimeric scaffold several Mbp in length which contain the entire ~15kb mitogenome embedded in it. The 15kb mitogenome portion of the scaffolds are 99.9-100% identical to the mitogenome assembled independently from Illumina data. So this is clearly a mis-assembly.

1) How can I avoid these chimeric scaffolds? Is the much higher expected coverage of the mitogenome not used to prevent this happening?

2) The presence of this chimeric scaffold makes me worry that there may be other chimeric scaffolds involving only nuclear sequence that are not so easily detected.

Thanks, KD

Operating system LSB Version: :core-4.1-amd64:core-4.1-noarch Distributor ID: CentOS Description: CentOS Linux release 7.9.2009 (Core) Release: 7.9.2009 Codename: Core

GCC gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC)

Python Python 3.7.4

NextDenovo nextDenovo v2.5.0

Additional context (Optional) Add any other context about the problem here.

moold commented 1 year ago
  1. You can filter reads from mitogenome by mapping all reads to mitogenome.
  2. In general, assembly errors cannot be completely avoided, but you can use Hic or Bionano data to split the chimeric scaffolds.