BGI-Qingdao / TGS-GapCloser

A gap-closing software tool that uses long reads to enhance genome assembly.
GNU General Public License v3.0
179 stars 13 forks source link

huge gaps not filled #77

Closed gforg34 closed 8 months ago

gforg34 commented 8 months ago

Hi,

So, I used TGS-GapCloser to improve a 800 Mbp chromosome from a 10 Gb genome assembly, using 84 Gb of PacBio long-read data (9x). Also I included the following arguments in the tgs command: -minmap_arg ' -x map-pb -K 80M' --thread 32 --racon otherwise it was running out of memory all the time.

I observed some improvement in reducing the number of 'N's in my final output, but it wasn't as significant as I had hoped. "Although many 'N's were successfully removed, I observed that sizable sections of 'N's, 150kbp - 300 kbp, remained unchanged. Could you explain to me why do you think there are still large segments of 'N's in the final output despite using gap-filling tools like TGS-GapCloser? What adjustments can be made to further enhance the chromosome assembly?

I am asking because I noticed you used tgsgapcloser for a similar situation, to improve the genome of G. biloba , which more or less applies to my situation (large genome around 10Gb same coverage). What did you do exactly and how did you manage to improve the genome (replacing N-containing regions by 411,608,879 bp of sequence)?

These are the assembly stats of my chromosome before and after.

stats for extracted_chr_old_reference.fasta
sum = 825750385, n = 1
N_count = 27,245,420
Gaps = 27060
-------------------
stats for chromosome.scaff_seqs
sum = 822675884, n = 1
N_count = 22,151,823
Gaps = 18719

Moreover, my genome assembly has also many unlocalized scaffolds, can I use these scaffolds as reads to improve the genome assembly (just a thought)? Let me know what you think about this, any help would be highly appreciated.

adonis316 commented 8 months ago

This is a known issue for the current form of TGS-GapCloser that has been reported frequently. This huge memory consumption comes from the large data size of input long reads. As it was originally designed for low depths, the algorithm cannot handle deep depths for long-read alignment. You can try TGS-GapCloser2 (https://github.com/BGI-Qingdao/TGS-GapCloser2). The usage is the same as that of TGSGapCloser, and can dramatically reduce the memory. But note that it has not been fully tested.

Long-read length, accuracy, and depth are key factors for the gap-closing efficiency. The low depth (9x) in your data cannot guarantee every genomic region can be covered by at least one long read. I have no detailed information for read length and accuracy. In addition, "--racon" would not work well for such a low depth because the racon-based error correction depends on the overlapping of long reads.

We used pre-corrected PacBio reads to close gaps in each chromosome of G. biloba using default parameters. In your case, I would suggest to 1) correct long reads using NGS short reads, 2) iteratively close gaps using the same corrected dataset.

For unlocalized small scaffolds/contigs, I would not recommend to use them for gap closing. They are likely from other unassembled genomic regions or mtDNA instead of chromosomes. If the chromosome construction is not good, especially containing lots of duplicated sequences and highly fragmented, then you can use unlocalized contigs to close gaps.

Thanks, Mengyang

gforg34 commented 8 months ago

Hi @adonis316,

Thanks for your immediate reply on this. So you are suggesting using the tgsgapcloser2 with the default parameters to reduce the memory and also using pre-corrected PacBio reads to close the gaps by using NGS short reads. However, racon does not do it and either pilon I guess? Would you recommend using a different software for correcting the low-depth reads? If so, are there any specific software options you could suggest? In my case the genome assembly is quite repetitive and also contains a lot of uncharacterized regions since it was assembled using short-reads. So I dont know if the unlocalized scaffolds are assembled regions that did not fit anywhere in the genome, so that's why I assume they can fill some gaps if the long reads cannot. Thanks again for you help. Let me know what you think!

adonis316 commented 8 months ago

As you have only 9x long reads, you should do error correction using accurate short reads. There are lots of tools with good performance. Pilon is an easy-to-use jar package for error correction (https://github.com/broadinstitute/pilon/releases/), but it is usually slow. Other options include Nextpolish (https://github.com/Nextomics/NextPolish) and fmlrc2 (https://github.com/HudsonAlpha/fmlrc2).

You can try to use unplaced scaffolds/contigs to close gaps. But I recommend you verify the results based on short-read coverage or other information after gap filling.

gforg34 commented 8 months ago

Thanks for your reply and suggestion @adonis316 . Unfortunately, I cannot update you with this, since the accesions of the reference genome is different from the accesion that I did generated long reads data. So I don't have long reads available for the that correspond to the genome assembly. Probably also this could be the reason, why tgsgapcloser did not worked satisfactory enough. So I have to close this issue for now, but I ll keep your suggestion as the most promising one.

Isoris commented 8 months ago

Can I ask you wether Pilon has the ability to close long gaps or not? Because I have used it before in another project and from my understanding it takes reads or frags --bam of long reads but it is not able to align long reads on its own?

SO do you recommend to first use Racon and then use Pilon ? Or ?

Thank you in advance