HKU-BAL / Clair3

Clair3 - Symphonizing pileup and full-alignment for high-performance long-read variant calling
246 stars 27 forks source link

It took 10 hrs to call variants against ~40MB reads via Docker Clair3 #312

Closed RunpengLuo closed 5 months ago

RunpengLuo commented 5 months ago

Hello,

Thanks a lot for developing this great variant caller Clair3. But I'm facing an issue when calling the variants against a small 2MB bacteria genome.

In particular, I have ONT read data with 40MB base called via Dorado sup mode. I pulled docker image via docker pull hkubal/clair3, and I used the model r1041_e82_400bps_sup_v410 from https://github.com/nanoporetech/rerio/tree/master/clair3_models for Clair3. I used the following command for Clair3.

        docker run -it --rm \
            -v ${ALN_DIR}:${ALN_DIR} \
            -v ${REF_DIR}:${REF_DIR} \
            -v ${VC_DIR}:${VC_DIR} \
            -v ${MODEL_NAME}:${MODEL_NAME} \
            hkubal/clair3:latest \
            /opt/bin/run_clair3.sh \
            --include_all_ctgs \
            --no_phasing_for_fa \
            --haploid_precise \
            --sample_name=${sample_id} \
            --bam_fn=${ALN_DIR}/${sample_id}.bam \
            --ref_fn=${REF_FILE} \
            --threads=${THREADS} \
            --platform="ont" \
            --model_path="${MODEL_NAME}" \
            --output=${vcf_dir} \
            1>$LOG_DIR/${sample_id}.clair3.out \
            2>$LOG_DIR/${sample_id}.clair3.err

I've got the following logs.

[INFO] CLAIR3 VERSION: v1.0.9
[INFO] BAM FILE PATH: /Users/luorunpeng/Downloads/all-e/Research/project-benjamin_lab/gonorrhoeae/20240418_ACTHealth_gonorrhea/output_clair3/Alignment/RB49.bam
[INFO] REFERENCE FILE PATH: /Users/luorunpeng/Downloads/all-e/Research/project-benjamin_lab/gonorrhoeae/20240418_ACTHealth_gonorrhea/reference_RB59.fasta
[INFO] MODEL PATH: /Users/luorunpeng/Downloads/all-e/Research/project-benjamin_lab/gonorrhoeae/clair_models/r1041_e82_400bps_sup_v410
[INFO] OUTPUT FOLDER: /Users/luorunpeng/Downloads/all-e/Research/project-benjamin_lab/gonorrhoeae/20240418_ACTHealth_gonorrhea/output_clair3/VarCall/RB49
[INFO] PLATFORM: ont
[INFO] THREADS: 5
[INFO] BED FILE PATH: EMPTY
[INFO] VCF FILE PATH: EMPTY
[INFO] CONTIGS: EMPTY
[INFO] CONDA PREFIX: 
[INFO] SAMTOOLS PATH: samtools
[INFO] PYTHON PATH: python3
[INFO] PYPY PATH: pypy3
[INFO] PARALLEL PATH: parallel
[INFO] WHATSHAP PATH: whatshap
[INFO] LONGPHASE PATH: EMPTY
[INFO] CHUNK SIZE: 5000000
[INFO] FULL ALIGN PROPORTION: 0.7
[INFO] FULL ALIGN REFERENCE PROPORTION: 0.1
[INFO] PHASING PROPORTION: 0.7
[INFO] MINIMUM MQ: 5
[INFO] MINIMUM COVERAGE: 2
[INFO] SNP AF THRESHOLD: 0.08
[INFO] INDEL AF THRESHOLD: 0.15
[INFO] BASE ERROR IN GVCF: 0.001
[INFO] GQ BIN SIZE IN GVCF: 5
[INFO] ENABLE FILEUP ONLY CALLING: False
[INFO] ENABLE FAST MODE CALLING: False
[INFO] ENABLE CALLING SNP CANDIDATES ONLY: False
[INFO] ENABLE PRINTING REFERENCE CALLS: False
[INFO] ENABLE OUTPUT GVCF: False
[INFO] ENABLE HAPLOID PRECISE MODE: True
[INFO] ENABLE HAPLOID SENSITIVE MODE: False
[INFO] ENABLE INCLUDE ALL CTGS CALLING: True
[INFO] ENABLE NO PHASING FOR FULL ALIGNMENT: True
[INFO] ENABLE REMOVING INTERMEDIATE FILES: False
[INFO] ENABLE LONGPHASE FOR INTERMEDIATE VCF PHASING: False
[INFO] ENABLE PHASING FINAL VCF OUTPUT USING WHATSHAP: False
[INFO] ENABLE PHASING FINAL VCF OUTPUT USING LONGPHASE: False
[INFO] ENABLE HAPLOTAGGING FINAL BAM: False
[INFO] ENABLE LONG INDEL CALLING: False
[INFO] ENABLE C_IMPLEMENT: True

+ /opt/bin/scripts/clair3_c_impl.sh --bam_fn /Users/luorunpeng/Downloads/all-e/Research/project-benjamin_lab/gonorrhoeae/20240418_ACTHealth_gonorrhea/output_clair3/Alignment/RB49.bam --ref_fn /Users/luorunpeng/Downloads/all-e/Research/project-benjamin_lab/gonorrhoeae/20240418_ACTHealth_gonorrhea/reference_RB59.fasta --threads 5 --model_path /Users/luorunpeng/Downloads/all-e/Research/project-benjamin_lab/gonorrhoeae/clair_models/r1041_e82_400bps_sup_v410 --platform ont --output /Users/luorunpeng/Downloads/all-e/Research/project-benjamin_lab/gonorrhoeae/20240418_ACTHealth_gonorrhea/output_clair3/VarCall/RB49 --bed_fn=EMPTY --vcf_fn=EMPTY --ctg_name=EMPTY --sample_name=RB49 --chunk_num=0 --chunk_size=5000000 --samtools=samtools --python=python3 --pypy=pypy3 --parallel=parallel --whatshap=whatshap --qual=2 --var_pct_full=0.7 --ref_pct_full=0.1 --var_pct_phasing=0.7 --snp_min_af=0.08 --indel_min_af=0.15 --min_mq=5 --min_coverage=2 --min_contig_size=0 --pileup_only=False --gvcf=False --base_err=0.001 --gq_bin_size=5 --fast_mode=False --call_snp_only=False --print_ref_calls=False --haploid_precise=True --haploid_sensitive=False --include_all_ctgs=True --no_phasing_for_fa=True --pileup_model_prefix=pileup --fa_model_prefix=full_alignment --remove_intermediate_dir=False --enable_phasing=False --enable_long_indel=False --keep_iupac_bases=False --use_gpu=False --longphase_for_phasing=False --longphase=EMPTY --use_whatshap_for_intermediate_phasing=True --use_longphase_for_intermediate_phasing=False --use_whatshap_for_final_output_phasing=False --use_longphase_for_final_output_phasing=False --use_whatshap_for_final_output_haplotagging=False

[INFO] Check environment variables
[INFO] Create folder /Users/luorunpeng/Downloads/all-e/Research/project-benjamin_lab/gonorrhoeae/20240418_ACTHealth_gonorrhea/output_clair3/VarCall/RB49/log
[INFO] Create folder /Users/luorunpeng/Downloads/all-e/Research/project-benjamin_lab/gonorrhoeae/20240418_ACTHealth_gonorrhea/output_clair3/VarCall/RB49/tmp/pileup_output
[INFO] Create folder /Users/luorunpeng/Downloads/all-e/Research/project-benjamin_lab/gonorrhoeae/20240418_ACTHealth_gonorrhea/output_clair3/VarCall/RB49/tmp/merge_output
[INFO] Create folder /Users/luorunpeng/Downloads/all-e/Research/project-benjamin_lab/gonorrhoeae/20240418_ACTHealth_gonorrhea/output_clair3/VarCall/RB49/tmp/phase_output
[INFO] Create folder /Users/luorunpeng/Downloads/all-e/Research/project-benjamin_lab/gonorrhoeae/20240418_ACTHealth_gonorrhea/output_clair3/VarCall/RB49/tmp/gvcf_tmp_output
[INFO] Create folder /Users/luorunpeng/Downloads/all-e/Research/project-benjamin_lab/gonorrhoeae/20240418_ACTHealth_gonorrhea/output_clair3/VarCall/RB49/tmp/full_alignment_output
[INFO] Create folder /Users/luorunpeng/Downloads/all-e/Research/project-benjamin_lab/gonorrhoeae/20240418_ACTHealth_gonorrhea/output_clair3/VarCall/RB49/tmp/phase_output/phase_vcf
[INFO] Create folder /Users/luorunpeng/Downloads/all-e/Research/project-benjamin_lab/gonorrhoeae/20240418_ACTHealth_gonorrhea/output_clair3/VarCall/RB49/tmp/phase_output/phase_bam
[INFO] Create folder /Users/luorunpeng/Downloads/all-e/Research/project-benjamin_lab/gonorrhoeae/20240418_ACTHealth_gonorrhea/output_clair3/VarCall/RB49/tmp/full_alignment_output/candidate_bed
Warning: cannot find your CPU L2 cache size in /proc/cpuinfo
[INFO] --include_all_ctgs enabled
[INFO] Call variant in contigs: sample_RB59_2179196
[INFO] Chunk number for each contig: 1
[INFO] 1/7 Call variants using pileup model
Calling variants ...
Total processed positions in sample_RB59_2179196 (chunk 1/1) : 71447
Total time elapsed: 13958.19 s

real    232m46.485s
user    192m1.469s
sys 0m2.607s
Warning: cannot find your CPU L2 cache size in /proc/cpuinfo
[INFO] 2/7 No phasing for full alignment calling

[INFO] 5/7 Select candidates for full-alignment calling
Warning: cannot find your CPU L2 cache size in /proc/cpuinfo
[INFO] Set variants quality cutoff 12.0
[INFO] Set reference calls quality cutoff 14.0
Warning: cannot find your CPU L2 cache size in /proc/cpuinfo
[INFO] Low quality reference calls to be processed in sample_RB59_2179196: 6633
[INFO] Low quality variants to be processed in sample_RB59_2179196: 3535

real    0m1.715s
user    0m1.693s
sys 0m0.099s

[INFO] 6/7 Call low-quality variants using full-alignment model
Calling variants ...
Total processed positions in sample_RB59_2179196 (chunk 2/2) : 168
Total time elapsed: 386.20 s
Calling variants ...
Total processed positions in sample_RB59_2179196 (chunk 1/2) : 10000
Total time elapsed: 22662.64 s

real    377m49.507s
user    382m54.863s
sys 0m4.867s
Warning: cannot find your CPU L2 cache size in /proc/cpuinfo

[INFO] 7/7 Merge pileup VCF and full-alignment VCF
Warning: cannot find your CPU L2 cache size in /proc/cpuinfo
[INFO] Pileup variants processed in sample_RB59_2179196: 1518
[INFO] Full-alignment variants processed in sample_RB59_2179196: 2362

real    0m2.378s
user    0m2.393s
sys 0m0.136s
Warning: cannot find your CPU L2 cache size in /proc/cpuinfo

[INFO] Finish calling, output file: /Users/luorunpeng/Downloads/all-e/Research/project-benjamin_lab/gonorrhoeae/20240418_ACTHealth_gonorrhea/output_clair3/VarCall/RB49/merge_output.vcf.gz

real    610m49.312s
user    575m9.000s
sys 0m8.469s

It took nearly 10 hrs to finish the whole execution under MacOS M2 chip, and I can see that the docker is running in full demand but doesn't seem to be paralleled throughout the execution (CPU utilisation is ~100%). Even when I downsample the reads to 1x, it still takes ~1.5hrs to finish the pileup step and stuck at the calling step. Just wondering whether this is normal and if there is any parameters I can use to speed it up, thanks a lot for your help!

John

aquaskyline commented 5 months ago

Hi Runpeng,

In fact, I never knew that Clair3 runs on Mac via docker. I just played with it, and I found that it was the docker virtualization framework that enabled such an easy deployment of Clair3 on Mac. Screenshot 2024-06-19 at 16 15 49

That is to say, the docker runs upon a virtual machine, which explains why it runs slower. To run Clair3 natively on Mac, here is a document. It is complicated, and the steps need a bit of tuning in each system I believe. I just tried the steps in the document on my Macbook M2 Max with Sonoma 14.5 and updated the document to get all the steps running again. Running natively on Mac means that Clair3 will use GPU thus runs as fast or faster than running on CPU on a Linux server. If you use Clair3 frequently on Mac, it is worth some effort to set it up.

RunpengLuo commented 5 months ago

Hi Dr. Ruibang,

I just try it on the linux server and it worked smoothly and finished in several minutes. Really appreciate your time on exploring this. For now I will stick with linux server but happy to try the Mac method via the linked document as well. Thanks again for your effort!