BGI-Qingdao / TGS-GapCloser

A gap-closing software tool that uses long reads to enhance genome assembly.
GNU General Public License v3.0
183 stars 13 forks source link

How to understand the unexpected gap closed output? #5

Closed cai1991 closed 4 years ago

cai1991 commented 4 years ago

Hello TGS-GapCloser team,

Thanks for developing this pipeline. I'm currently using this tool to close the gaps in my scaffolds which are generated after Bionano hybrid scaffolding. I used my pre-assembled contigs (produced by MaSuRCA) to close the gaps without error correction. I have some questions about the gap-filltered results:

  1. There are some negative gaps in my scaffolds (13bp length in my case). I checked the alignment in Bionano access and found overlaps between the flanking contigs. For this kind of gaps, instead of merging the overlapped sequence, TGS-GapCloser inserted a fragment here (Like a 23Kb overlap in scaffold, a 20K fragment from long read inserted). Below the example, 13bp gap from 3520056 to 3520068. There are quite a few this kind of situations. >Super-Scaffold_100020 1 3208497 S 1 3208497 3208498 3518242 S 3210311 3520055 3518243 3538539 F 3538540 3884641 S 3520069 3866170 3884642 4113877 F 4113878 4199241 S 4091907 4177270
  2. I also found that Bionano estimated a 123Kb gap. TGS-GapCloser closed 66Kb of it. However, I didn't find any "N" left here. (below the example) >Super-Scaffold_267 1 147545 S 1 147545 147546 213699 F 213700 423197 S 270648 480145
  3. The genome size increased by 22Mb, while the gap size estimated by Bionano is just ~6Mb. TGS-GapCloser indeed close all of the gaps. No "N" left.

Are the observations above normal? How could these happen? Look forward to your reply and thank you very much in advance.

Kind regards, Chengcheng

cchd0001 commented 4 years ago

Hi Chengcheng,  

  1. Explanation of the differences between "Bionano estimated gaps" and "TGS-GapCloser filled gaps" :        TGS-GapCloser uses "Input Long Reads" to close gaps in "Input Scaffolds".  It defaults the assembly information provided by "Long Reads".       In your project, TGS-GapCloser applies assembly information from ”pre-assembled contigs (produced by MaSuRCA)“ to close gaps in " Bionano hybrid scaffolds", but uses no gap size information from the input scaffolds. It depends on which assembly information you trust more, Bionano or long read?   2.  Our default application scenario is using error-prone TGS reads as "Input Reads". Thus, the default parameters might not be suitable for your high-quality assembled contigs.  I would suggest that you try to increase thresholds such as --min_match and --min_idy values.  
  2. If a reference assembly is available, you can assess the final result with the reference and compare with the input assembly. If not, try BUSCO.   Best wishes, Lidong
cai1991 commented 4 years ago

Thanks for the suggestion. I will try more.

Kind regards, Chengcheng