BGI-Qingdao / TGS-GapCloser

A gap-closing software tool that uses long reads to enhance genome assembly.
GNU General Public License v3.0
179 stars 13 forks source link

TGS-GapCloser dies #10

Closed dcopetti closed 4 years ago

dcopetti commented 4 years ago

Hello, I was running TGS Gapcloser on a >4 Gb plant assembly as tgsgapcloser --scaff OF1p3cSS2_17285_200609.fa --reads Rabiosa_G303_all_20kb_q8.fa --racon /home/copettid/miniconda2/envs/tgs_gapcloser_env/bin/racon --output OF1p3cSS2_TGSgc --min_idy 0.3 --thread 90 >pipe_ass.log 2>OF1p3cSS2_TGSgc.err and it died unexpectedly. The files I have are:

-rwx------. 1 copettid mpb 4.4G Jun 11 14:26 OF1p3cSS2_17285_200609.fa
-rw-r--r--. 1 copettid mpb  146 Jun 11 23:13 OF1p3cSS2_TGSgc.seq_split.log
-rw-r--r--. 1 copettid mpb 3.8G Jun 11 23:15 OF1p3cSS2_TGSgc.contig
-rw-r--r--. 1 copettid mpb 950K Jun 11 23:15 OF1p3cSS2_TGSgc.orignial_scaff_infos
-rw-r--r--. 1 copettid mpb 350K Jun 11 23:15 OF1p3cSS2_TGSgc.name_map
-rw-r--r--. 1 copettid mpb 3.0T Jun 13 17:02 OF1p3cSS2_TGSgc.sub.paf
-rw-r--r--. 1 copettid mpb  20K Jun 13 17:02 OF1p3cSS2_TGSgc.minimap2.01.log
-rw-r--r--. 1 copettid mpb    0 Jun 13 17:02 OF1p3cSS2_TGSgc.ont.fasta
-rw-r--r--. 1 copettid mpb 1016 Jun 13 17:02 pipe_ass.log
-rw-r--r--. 1 copettid mpb  662 Jun 13 17:05 OF1p3cSS2_TGSgc.cand.log
-rw-r--r--. 1 copettid mpb  314 Jun 14 07:20 OF1p3cSS2_TGSgc.err

and this is what the log and .err file say:

(base) bash-4.2$ less OF1p3cSS2_TGSgc.err
/home/copettid/miniconda2/envs/tgs_gapcloser_env/bin/tgsgapcloser: line 374: 94936 Killed                  $Candidate --ont_reads_a $TGS_READS --contig2ont_paf $OUT_PREFIX.sub.paf --candidate_max 10 --candidate_shake_filter --candidate_merge < $TMP_INPUT_SCAFF_INFO > $OUT_PREFIX.ont.fasta 2> $OUT_PREFIX.cand.log
(base) bash-4.2$ less OF1p3cSS2_TGSgc.cand.log
/home/copettid/miniconda2/envs/tgs_gapcloser_env/bin/tgsgapcloserbin/tgsgapcandidate --contig2ont_paf OF1p3cSS2_TGSgc.sub.paf --ont_reads_a Rabiosa_G303_all_20kb_q8.fa --candidate_merge true --candidate_shake_filter true --candidate_max 10
TGSGapCandidate INFO    CET     2020/6/13       17:2:27 :       TGSGapCandidata start now ...
TGSGapCandidate INFO    CET     2020/6/13       17:2:27 :       LoadONTReads start now ...
TGSGapCandidate INFO    CET     2020/6/13       17:4:52 :       >total load ONT reads :
1975530
TGSGapCandidate INFO    CET     2020/6/13       17:4:52 :       LoadONTReads finish. used wall clock : 145 seconds, cpu time : 144.869995 seconds
TGSGapCandidate INFO    CET     2020/6/13       17:4:52 :       LoadPAF start now ...
(tgs_gapcloser_env) bash-4.2$ tail pipe_ass.log

INFO  :   Checking basic args & env end.

INFO  :   Step 1 , run TGSSeqSplit to split scaffolds into contigs.

INFO  :   Step 1 , done .

INFO  :   Step 2 , run TGSCandidate ...
              -   2.1 , run minmap2 to map contig into tgs-reads.
              -   2.2 , run TGSGapCandidate to choose candidate region seqs.

The job was running on a 1 TB RAM, 96 core machine with two other jobs going - one of them is another TGS-GapCloser job (on a 20 Mb assembly) run in the same folder but with a different prefix. I have about 15 TB of free disk space, and I don't think that all that space could have been used when it crashed. I don't know if it run out of memory (though I doubt it), or if the two jobs were writing data into a common file (that did not have the prefix I specified though). The other job is still running normally.

What could be the cause for this dead job? Thanks, Dario

adonis316 commented 4 years ago

Hi Dario, The err shows that your job was killed, highly possibly due to the large memory requirement (>1T). It is reasonable because your alignment PAF file is as large as 3T. I guess that the sequencing depth of input PacBio or ONT long reads is not low, or your plant genome has relatively more repeats, which might introduce multi-alignments of long reads. To solve it:

  1. You could split the input scaffolds into several parts by chromosome (or scaffold), and run the TGS GapCloser one by one.
  2. Or you could decrease the number of input long reads, and iteratively close gaps with different long read datasets. Typically, 15X is sufficient.

Thanks, Mengyang

dcopetti commented 4 years ago

Hi Mengyang, Thank you for the feedback, I split the assembly in ~1 Gb files. My genome has indeed lots of repeats (>80% TEs, about 25% of the sequence belongs to only one family) and this could be the reason for the multi-alignments. My long reads are all 20 kb or longer, above QV8 and are about 16x coverage. Do you think I should be a bit more stringent, like reducing --min_idy? How about adding back the map-ont option? (see https://github.com/BGI-Qingdao/TGS-GapCloser/issues/8 )

Also, which is the stage that requires lots of memory? I would like to e.g. start the second chunk while the first is still going, but without running out of memory of course. Thanks

adonis316 commented 4 years ago

I would suggest that increase --min_match and "increase" --min_idy to reduce the RAM requirement. But it could affect the gap-closing efficiency. It also depends on your input contig length.

Plus, you could also try map-ont through manually modifying the script TGS-GapCloser.sh. Again, it might decrease the efficiency.

QV8 is not a good quality value. So, I would suggest to polish the long reads with Racon.

The TGSCandidate requires a lot of memory, which stores all the candidate long reads into cache. The other stage probably consuming much memory is Racon, which stores all overlaps to polish the sequences. You can estimate the memory based on the PAF alignment file size.

Thanks, Mengyang

dcopetti commented 4 years ago

Thanks! I will at first try error correcting the reads with Racon (the documentation is not clear on which type of input data is preferred or required): do you think that the memory needed will be lower with e.c. reads? Do you have any recommended minimap2/racon settings?

My contig N50 is about 670 kb (N90 130 kb), scaffold N50 is 20 Mb or so. The paf file is 3 TB, do you think that it was the issue in this case?

To simplify the minimap file (without losing too much information of course) could I add -N (3?) to the minimap command? Also, can I change THREAD=16 in tgsgapcloser to a higher value (I have 96 CPUs available) since I often saw that just a fraction of the CPUs were used?

I am running a test on a 20 Mb (0.5% of the total assembly) scaffold (same input read file) and it is taking 5 days: is this the time line I should expect?

adonis316 commented 4 years ago

You could either do the error correction prior to or during the gap closure by TGS_GapCloser. The difference is that the pre-error correction would consume much more CPU and memory. But the reduced number of input long reads would decrease the consumption in the gap closure.

Based on my experience, I would suggest to use ava-pb/ava-ont for minimap2 and default parameters for racon, which are the settings in TGS_GapCloser.

Your contigs are pretty long and that could be part of the reason for the high possibility of mapping. I would recommend to increase the "--min_match" to >1kb.

We have already limited the -N to 10, but you could reduce further. Note that it will decrease the mapping efficiency.

The thread number has an important effect on the mapping and error correction stages. Please use threads as much as you can. But other stages currently do not support multiple processing.

We have tested a 60Mb input scaffold with 30X long reads using 8 threads, and it was finished in one hour. Please check the log and PAF size.