BGI-Qingdao / TGS-GapCloser

A gap-closing software tool that uses long reads to enhance genome assembly.
GNU General Public License v3.0
168 stars 12 forks source link

Unable to move beyond mapping the reads #43

Closed amit4mchiba closed 2 years ago

amit4mchiba commented 2 years ago

Hi,

I am writing here to seek your help to solve the error that I am getting while running tgs-gapcloser. I am working on a plant genome assembly with a size of 1.75Gb, and I have used pacbio (75x genome coverage) for getting contigs and scaffolding was performed using HiC libraries. Once assembly was generated, I wanted to perform gap closing. For that, I used entire raw reads that I used for canu-based assembly, and run tgsgapcloser in the correction mode using the following command- tgsgapcloser \ --scaff assembly.FINAL.fasta \ --reads raw_rads.fasta.gz \ --output Sf_tgsgap_raconpolished \ --racon /mnt/HD1/racon/build/bin/racon \ --tgstype pb \ --thread 48 \ --r_round 3

Here is the error I received- INFO : Run tgsgapcloser from /mnt/HD1/miniconda3/bin ; Version : 1.1.1 ; Release time : 2019-12-31 .

INFO : Parsing args starting ... --scaff 2-Sf_3rdAug2020_0.03_0.085_l10000_purge_gcpp.FINAL.fasta --reads Sf_all_pacbioraw.fasta.gz --output Sf_tgsgap_raconpolished --racon /mnt/HD1/racon/build/bin/racon --tgstype pb --thread 48 --r_round 3

INFO : Parsing args end .

INFO : Checking basic args & env ...

INFO : Checking basic args & env end.

INFO : Step 1 , run TGSSeqSplit to split scaffolds into contigs.

INFO : Step 1 , done .

INFO : Step 2 , run TGSCandidate ...

I then checked log file, this is what I got, /home/amit8chiba/miniconda2/bin/tgsgapcloserbin/tgsgapcandidate --contig2ont_paf Sf_tgsgap_raconpolished.sub.paf --ont_reads_a Sf_all_raw_pacbio.fasta.gz --candidate_merge true --candidate_shake_filter true --candidate_max 10 TGSGapCandidate INFO JST 2022/2/14 5:6:55 : TGSGapCandidata start now ... TGSGapCandidate INFO JST 2022/2/14 5:6:55 : LoadONTReads start now ... TGSGapCandidate INFO JST 2022/2/14 5:26:12 : >total load ONT reads : 14967046 TGSGapCandidate INFO JST 2022/2/14 5:26:12 : LoadONTReads finish. used wall clock : 1157 seconds, cpu time : 1017.411621 seconds TGSGapCandidate INFO JST 2022/2/14 5:26:12 : LoadPAF start now ...

Seems that it stopped at the TGSGapcandidate stage.

I wonder as to how shall I proceed? I think that the program keep getting too much ram of my computer, and eventually, it was killed by my server. This is despite defining the threads to be used. Also, is it possible to restart the run without the need to start the process from step 1? I mean, is it possible to start from a stage once a few stages are successfully done.

I will be grateful if you could advise me here. Please let me know if any details you need to know to be able to advise me.

I installed tgsgapcloser, and minimap2 using conda, and racon by compiling it.

thank you in advance,

with best regards Amit

adonis316 commented 2 years ago

Hi Amit, Seems like the PAF file size is too large and TGS-GapCloser was terminated by the system or administrator due to the memory exceeded error.

You could try to subsample the PacBio long reads (prefer longer reads or randomly) to reduce memory usage. Then iteratively use a portion to close the gaps.

You could also use canu-corrected reads to skip the error correction when closing the gaps to save time. But it has no effect on the memory usage.

Of course you can restart the program to skip a few steps if the input datasets remain the same. Note that it would work if you can increase the memory limit of your server. Otherwise, you should start from the beginning if you use fewer input long reads to close the gaps.

Thanks, Mengyang

amit4mchiba commented 2 years ago

Dear Mengyang,

Thank you so much for such a quick response.

I actually never set a memory limit, but it seems that it reaches up to 1TB memory, and then terminates (this is when tgsgapcloser starts to post minimap2 results).

About subsample, you mean using chunck option?

About restarting, I actually did that, but then the program started from stage one and begin with mapping. I wonder if I could run this program in a step-wise manner.

Also, about using canu corrected reads for gap filling, that does sound a good idea. I wonder as what will be your recommendation about the size of datasets? I mean, you think that about 10x genome coverage should be chosen, and then I should run gap filling in a loop to use all the coverage (I have 70x coverage, so, I should run gapfilled genome as input for each cycle to useup all the reads?)

Thank you so much in advance for your advice,

with best regards Amit

adonis316 commented 2 years ago

Hi Amit,

  1. We have successfully tested 10x long-read dataset (120Gb pacbio) to close gaps for a 12Gb ginkgo genome assembly using a 3TB node. The bottleneck of the memory usage in your case is probably loading the huge PAF file size, which depends on the genome features (repeats…) and long-read profile (basecalling accuracy...). Please check the PAF alignment file size. Make sure it is at least smaller than 1TB.
  2. As TGS-GapCloser was designed for low-coverage long reads, 20x could be sufficient to close most of the gaps with high quality. You could try to use 20x longest reads to close gaps. It would definitely reduce the memory usage. But it might fail for a repetitive plant genome. Then try fewer.
  3. The chunk strategy is also an option. I would suggest to use 20x~30x subsampled data at each time.
  4. It is required to install TGS-GapCloser by GitHUB if you want to restart the program in the middle. The installation is pretty easy and only requires gcc 4.4+ and make 3.8+. You can also modify the main pipeline TGS-GapCloser.sh from there.

Thanks, Mengyang

amit4mchiba commented 2 years ago

Dear Mengyang,

Loads of thanks. Indeed, the size of PAF file is 1.5Tb, which probably explains the issue. I will start the analysis using 20-30x coverage using the longest possible reads.

Just the last question, could you advise as to how to filter pacbio reads based on reading length? I mean, normally for assembly, we provide read length cutoff, but I am not sure as to how to do it for raw reads. Do you have any recommendations? I am aware of splitting fasta files but not filtering them based on reading length.

Thanks again. I will re-run and will let you know if successful or if again caught with some new trouble.

with best regards Amit

adonis316 commented 2 years ago

Hi Amit, You can use FASTA/Q tools such as seqkit https://github.com/shenwei356/seqkit to sort long reads by length. (seqkit sort -lr) Then head the first few lines to the desired data size (35~50Gb).

Thanks, Mengyang

amit4mchiba commented 2 years ago

Thank you so much for your reply. I have begin the analysis following your suggestions and will let you know once I am done.

Meanwhile, I had one more question, and would be grateful if you could help me to confirm the steps that I have taken here.

I used the canu corrected and trimmed reads as input for gapfilling (this is 45x genome coverage) and used following command- tgsgapcloser \ --scaff assembly.FINAL.fasta \ --reads assembly.trimmedReads.fasta.gz \ --output assembly_tgsgap_ne \ --racon /mnt/HD1/racon/build/bin/racon \ --tgstype pb \ --thread 48 \ --ne

I checked the run log, and it shows like this- INFO : Run tgsgapcloser from /mnt/HD1/miniconda3/bin ; Version : 1.1.1 ; Release time : 2019-12-31 .

INFO : Parsing args starting ... --scaff assembly.FINAL.fasta --reads assembly.trimmedReads.fasta.gz --output assembly_tgsgap_ne --racon /mnt/HD1/racon/build/bin/racon --tgstype pb --thread 48 --ne

INFO : Parsing args end .

INFO : Checking basic args & env ...

INFO : Checking basic args & env end.

INFO : Step 1 , run TGSSeqSplit to split scaffolds into contigs.

INFO : Step 1 , done .

INFO : Step 2 , skip TGSCandidate by --ne option.

INFO : Step 3 , skip error-correction by --ne option.

INFO : Step 4 , gap filling ...

As you can see, here, TGSCandidate seems skipped. Is this ok? And the commands that were used are acceptable?

Will appreciate your advice here.

thank you so much,

with best regards Amit

adonis316 commented 2 years ago

Seems OK to me. As you used pre-corrected reads, the “--ne” option skipped step3 “error correction” as well as step2 “TGSCandidate”, which determined long-read candidates waiting for the next error-correction procedure.

Mengyang

amit4mchiba commented 2 years ago

Dear Mengyang,

I am writing here to seek your advise as I am still stuck the same place. As you suggested, I tried following things and it still could not complete the run-

  1. I first cut-down the raw reads to 20x genome coverage, and then run the pipeline in the racon correction mode. The resulting .paf file was sized 576 Gb (less than 1Tb). But then it still crashed since it reached 1TB memory.
  2. Next, I decided to use corrected reads, and then tool 15x genome coverage, and run this pipeline without correction mode. Unfortunately, in this case, the alignment file reach 1.4Tb, and it crashed.

In both of the cases, I was not able to complete the gap-filling. I realized that after mapping, the program keeps seeking more memory and that results in the server killing the program (This is because my server max memory is 1TB, and it reaches almost 0.99Tb and then kills the program automatically). It is possible to fix the memory that tgsgapcloser can take, and then, even if it takes longer, atleast it will not be killed? For example, in case of pbjelly, it does take longer time, but it keeps running and eventually fill the gaps. While tgsgapcloser is much much faster, but I am not able to run this program in this case. I will appreciate your advise.

I also tried one thing. My plant genome is highly hetrozygous, and has nine pseudomolecules. I though that if I fill the gaps one chromosomes at a time, probably that would be great. So, What I did was that I split the genome assembly into nine assemblies (nine individual pseudomolecules), and run tgsgapcloser using 30x raw pacbio reads. In this case, each pseudomolecule resulted in around 400Gb, and it resulted in gap filled assembly. However, in this case, it seems it filled all the gaps, which to me seems super suspicious. I thought that the reads that were used to fill gap for a pseudomolecule is probably again used for other pseudo molecule. Is this approach correct? Is there someway i could proceed?

I am so sorry for such a long message and will be grateful for your insights.

thank you so much,

with best regards Amit

adonis316 commented 2 years ago

Sorry to hear about that. It seems like the aligner, minimap2 generated too many alignments due to the high heterozygosity of the target genome. Thus, I would suggest:

  1. Try to use even lower coverage of long reads as input to further reduce possible alignments.
  2. Use a stricter parameter for minimap2 by adding “ --minmap_arg \' -x map-pb\' ” to reduce the output alignments from minimap2.
  3. Use a stricter parameter for choosing candidate long reads by increasing “ --min_match ” to filter out suspicious multi-alignments. It might resolve the problem of genome size expanding in filling gaps for each chromosome individually.

We have tested splitting genome by chromosome and filling gaps individually. The result was OK with reduced memory as we used larger values for “ --min_match ” and “ --min_idy ”.

Long reads used for genome assembly can be re-used for gap closure because some reads with higher error rates could be filtered out by the assembly algorithm but provided long-range connection relation.

The actual memory usage is much larger than the paf size. We are working on an algorithm improvement.

The efficiency is the first priority for TGS-GapCloser. Thus, it is possible that some gaps are wrongly filled by low-quality alignments, especially for high coverage of long reads. Please check the genome size or other criteria. The resulting genome size after parameter tunning should be close to the expected size evaluated by other techniques (kmer analysis, flow cytometry…).

Thanks, Mengyang

qdu-beep commented 8 months ago

Hi Amit,

  1. We have successfully tested 10x long-read dataset (120Gb pacbio) to close gaps for a 12Gb ginkgo genome assembly using a 3TB node. The bottleneck of the memory usage in your case is probably loading the huge PAF file size, which depends on the genome features (repeats…) and long-read profile (basecalling accuracy...). Please check the PAF alignment file size. Make sure it is at least smaller than 1TB.
  2. As TGS-GapCloser was designed for low-coverage long reads, 20x could be sufficient to close most of the gaps with high quality. You could try to use 20x longest reads to close gaps. It would definitely reduce the memory usage. But it might fail for a repetitive plant genome. Then try fewer.
  3. The chunk strategy is also an option. I would suggest to use 20x~30x subsampled data at each time.
  4. It is required to install TGS-GapCloser by GitHUB if you want to restart the program in the middle. The installation is pretty easy and only requires gcc 4.4+ and make 3.8+. You can also modify the main pipeline TGS-GapCloser.sh from there.

Thanks, Mengyang

Hello,Mengyang, I would appreciate your assistance. Because the server I'm using doesn't have a memory limit set, I'm worried about destroying the server after I run tgsgapcloser. I have checked some issues about large genome, but am still not sure how to run my genome. Thank you for your patience!

I have a 3.7G plant genome with a repetitive sequence content of 67%. In addition, I plan to use 30× hifi reads and the output of hifiasm (p_utg or p_ctg sequence) to fill the gaps in the genome. The genome contig N50 is 96,520,916 and the scaffold N50 is 303,828,950.

The command I plan to run is as follows, using --ne seems to skip the two most memory consuming steps. Please tell me if I should reduce hifi readds (seqkit sort), or use larger values for " --min_match " and " --min_idy "? command: tgsgapcloser --scaff final.fa \ --reads hifi.fasta \ ##p_utg.fasta or p_ctg.fasat --output hifi \ --ne \ --minmap_arg '-x asm20 -K 80M' \ --tgstype pb \ --thread 28 \ --g_check \ --min_match 2000 \

pipe.log 2 > pipe.err