JiaoLaboratory / CRAQ

Identification of errors in draft genome assemblies with single-base pair resolution for quality assessment and improvement
https://doi.org/10.1038/s41467-023-42336-w
MIT License
53 stars 5 forks source link

Out of Memory #14

Closed yangfangyuan0102 closed 6 months ago

yangfangyuan0102 commented 6 months ago

Hi, Dear author The run failed on my machine with 128Gb RAM. My genome is 350 Mb; Pacbio CLR long-read bam is 21 Gb and short-read bam is 11Gb. The memory usage slowly and continuously increased, and finally be killed. Is this normal?

Best wishes

craq -g genome.fa -sms lgs.sort.bam -ngs sgs.sort.bam --sms_coverage 150 --ngs_coverage 65 -t 32 Running CRAQ analysis ......... PARAMETERS: Genome sequence(-g): genome.nextpolish.fa SMS input(-sms): lgs.sort.bam NGS input(-ngs): sgs.sort.bam
Minimum NGS clipped-reads (-sn): 2 Minimum SMS clipped-reads (-ln): 2 Clipping coverRate(-sf): 0.75 Lower clipping rate for heterozygous allele(-hmin): 0.4 Upper clipping rate for heterozygous allele(-hmax): 0.6 Block score benchmarking (-rw): 1000000 Gap[N] is treated with (-gm): 1:CRE Minimum gapsize(-mgs): 10 Break chimera fragments (-b): F Search error region (-ser): T Mapping SMS reads use (-x): map-hifi Mapping quality (-q): 20 Window size for error normalizing (-nw): 35560 Plot CRAQ metrics (-pl): F Alignment thread(-t): 32 Current working at : /mnt/dataset/moth/species/cs/0assembly/9polish/4polish CRAQ output dir(-D): /mnt/dataset/moth/species/cs/0assembly/9polish/4polish/output -------------------------Start Running-------------------------

Running SMS long-reads CRAQ analysis ...... CMD: /mnt/data/miniconda3/envs/craq/bin/../share/CRAQ/src/runLR.sh -g genome.nextpolish.fa -x map-hifi -z seq.size -1 lgs.sort.bam -q 20 -m 2 -f 0.4 -h 0.6 -r 0.75 -a 150 -d 50000 -v F -t 32 worker_pipeline:: Skipping alignment:: [M::worker_pipeline:: Filtering bamfiles] [M::worker_pipeline:: Compute effective SMS coverage] [M::worker_pipeline:: Extract SMS clipping signal] [M::worker_pipeline:: Collect potential CSE|H] [M::worker_pipeline:: Collect potential CRE|H]

Out of Memory and the running was killed here.

JiaoLaboratory commented 6 months ago

This situation is possible, but based on our tests, it occurs almost exclusively in the case of exceptionally long single chromosomes (e.g., when the length of a single chromosome exceeds 300 Mb). Generally, CRAQ consumes more memory when processing high-noise ONT data compared to HiFi data. Even although the memory usage still remains within manageable limits. By the way, could you please provide information on how your 'lgs.sort.bam' was generated? Additionally, if the memory issue cannot be resolved, based on our experience, reducing the usage of long reads (e.g., using only 50X coverage ) is almost no impact on the final results.

yangfangyuan0102 commented 6 months ago

Hi, @JiaoLaboratory Thank you for your reply. I mapped the long reads using minimap2 with typical options "-ax map-pb | samtools view -F1796 -q 20 | samtools sort ....". I tried to reduce datasize in half, 75X. Memory is still insufficient. One thing I realized was that my genome was polished by long- and short-reads mutiple time, so it could be very different from the raw long reads. Will this cause the raw long reads to be too noisy?

JiaoLaboratory commented 6 months ago

It's highly likely that this is the cause. Currently, CRAQ's handling of high-noise alignments consumes significant resources. Would you mind trying the previous version of CRAQ-v1.0.8 (https://github.com/JiaoLaboratory/CRAQ/releases/download/v1.0.8/CRAQ-v1.0.8.zip)? This version lacks optimizations for high-accuracy data processing (which consumes resources in that step), but there haven't been significant changes in handling high-noise ONT/CRL data with CRAQ-v1.0.9.

yangfangyuan0102 commented 6 months ago

It's highly likely that this is the cause. Currently, CRAQ's handling of high-noise alignments consumes significant resources. Would you mind trying the previous version of CRAQ-v1.0.8 (https://github.com/JiaoLaboratory/CRAQ/releases/download/v1.0.8/CRAQ-v1.0.8.zip)? This version lacks optimizations for high-accuracy data processing (which consumes resources in that step), but there haven't been significant changes in handling high-noise ONT/CRL data with CRAQ-v1.0.9.

Great, The version 1.0.8 works. Thank you!