Closed dcopetti closed 4 years ago
Hi Dario, The err shows that your job was killed, highly possibly due to the large memory requirement (>1T). It is reasonable because your alignment PAF file is as large as 3T. I guess that the sequencing depth of input PacBio or ONT long reads is not low, or your plant genome has relatively more repeats, which might introduce multi-alignments of long reads. To solve it:
Thanks, Mengyang
Hi Mengyang,
Thank you for the feedback, I split the assembly in ~1 Gb files. My genome has indeed lots of repeats (>80% TEs, about 25% of the sequence belongs to only one family) and this could be the reason for the multi-alignments. My long reads are all 20 kb or longer, above QV8 and are about 16x coverage.
Do you think I should be a bit more stringent, like reducing --min_idy
? How about adding back the map-ont
option? (see https://github.com/BGI-Qingdao/TGS-GapCloser/issues/8 )
Also, which is the stage that requires lots of memory? I would like to e.g. start the second chunk while the first is still going, but without running out of memory of course. Thanks
I would suggest that increase --min_match and "increase" --min_idy to reduce the RAM requirement. But it could affect the gap-closing efficiency. It also depends on your input contig length.
Plus, you could also try map-ont through manually modifying the script TGS-GapCloser.sh. Again, it might decrease the efficiency.
QV8 is not a good quality value. So, I would suggest to polish the long reads with Racon.
The TGSCandidate requires a lot of memory, which stores all the candidate long reads into cache. The other stage probably consuming much memory is Racon, which stores all overlaps to polish the sequences. You can estimate the memory based on the PAF alignment file size.
Thanks, Mengyang
Thanks! I will at first try error correcting the reads with Racon (the documentation is not clear on which type of input data is preferred or required): do you think that the memory needed will be lower with e.c. reads? Do you have any recommended minimap2/racon settings?
My contig N50 is about 670 kb (N90 130 kb), scaffold N50 is 20 Mb or so. The paf file is 3 TB, do you think that it was the issue in this case?
To simplify the minimap file (without losing too much information of course) could I add -N (3?) to the minimap command? Also, can I change THREAD=16
in tgsgapcloser
to a higher value (I have 96 CPUs available) since I often saw that just a fraction of the CPUs were used?
I am running a test on a 20 Mb (0.5% of the total assembly) scaffold (same input read file) and it is taking 5 days: is this the time line I should expect?
You could either do the error correction prior to or during the gap closure by TGS_GapCloser. The difference is that the pre-error correction would consume much more CPU and memory. But the reduced number of input long reads would decrease the consumption in the gap closure.
Based on my experience, I would suggest to use ava-pb/ava-ont for minimap2 and default parameters for racon, which are the settings in TGS_GapCloser.
Your contigs are pretty long and that could be part of the reason for the high possibility of mapping. I would recommend to increase the "--min_match" to >1kb.
We have already limited the -N to 10, but you could reduce further. Note that it will decrease the mapping efficiency.
The thread number has an important effect on the mapping and error correction stages. Please use threads as much as you can. But other stages currently do not support multiple processing.
We have tested a 60Mb input scaffold with 30X long reads using 8 threads, and it was finished in one hour. Please check the log and PAF size.
Hello, I was running TGS Gapcloser on a >4 Gb plant assembly as
tgsgapcloser --scaff OF1p3cSS2_17285_200609.fa --reads Rabiosa_G303_all_20kb_q8.fa --racon /home/copettid/miniconda2/envs/tgs_gapcloser_env/bin/racon --output OF1p3cSS2_TGSgc --min_idy 0.3 --thread 90 >pipe_ass.log 2>OF1p3cSS2_TGSgc.err
and it died unexpectedly. The files I have are:and this is what the log and .err file say:
The job was running on a 1 TB RAM, 96 core machine with two other jobs going - one of them is another TGS-GapCloser job (on a 20 Mb assembly) run in the same folder but with a different prefix. I have about 15 TB of free disk space, and I don't think that all that space could have been used when it crashed. I don't know if it run out of memory (though I doubt it), or if the two jobs were writing data into a common file (that did not have the prefix I specified though). The other job is still running normally.
What could be the cause for this dead job? Thanks, Dario