BGI-Qingdao / TGS-GapCloser

A gap-closing software tool that uses long reads to enhance genome assembly.
GNU General Public License v3.0
172 stars 12 forks source link

Potential misassembly? #44

Closed MattHuff closed 2 years ago

MattHuff commented 2 years ago

Good Morning,

Our lab used TGS-GapCloser on a genome we are working on after the failure of a previous program, Dentist, to effectively close existing gaps. In looking at the alignment of our input reads to the output of TGS-GapCloser, we are mostly satisfied with the gap closing, but we noticed an oddity at one of the closed gaps. Only a single HiFi read aligned back to this section of the genomes, with all other reads having a deletion relative to the genome for this closed gap. We have a coverage of 213x with these HiFi reads, and the genome we closed gaps with was in an all but complete state. If you need any additional information, such as images from IGV, please let me know.

adonis316 commented 2 years ago

Hi MattHuff, Thank you for letting us know this problem. TGS-GapCloser was initially designed for the gap closing using a low coverage depth of long reads, in which the sensitivity (how many gaps can be closed) was the first priority. However, we noticed more and more researchers are using a high coverage as the price of long reads is getting lower. Thus, more attention needs to be paid to the precision (how many closed gaps are true sequences of targe genome).

There is a possibility that a gap can be wrongly closed due to scaffolding misassembly or false alignment relationship. It depends on the polyploidy, genome size, proportion of repeats, and heterozygosity of targe genome, as well as base-calling accuracy and read length of PacBio or Oxford nanopore long reads. To avoid this situation, we added three parameters:

Too few or too many long reads that support the same connection would cause type 1 errors. In our test, increasing --min_nread or decreasing --max_nread will increase positive predictive value to some extent, at the expense of the sensitivity. You can determine the values based on your sequencing coverage.

Thanks, Mengyang