HKU-BAL / Clair3

Clair3 - Symphonizing pileup and full-alignment for high-performance long-read variant calling
246 stars 27 forks source link

Difference in number of calls: CRAM vs BAM #344

Open nicolechai opened 2 weeks ago

nicolechai commented 2 weeks ago

Hi there,

I’m currently comparing the results of Clair3 v1.0.5 when alignments are stored within a BAM file vs CRAM. I am using HG002 replicates, and CRAM files were converted from the BAM file.

When comparing total number of variants called from the BAM file vs converted CRAM, I am seeing a 1-5 variant difference for 6 out of 8 HG002 replicates (out of on average 6 million variants called per replicate). Another thing I found interesting is that once the CRAM is converted back to BAM and is processed through Clair3, the total number of variant calls from the converted BAM matches the calls from original BAM.

Here is an example of what I am seeing: Sample File type Clair3 total number of calls
HG002_replicate_A BAM 6095272
HG002_replicate_A CRAM 6095277
HG002_replicate_A BAM (converted back from CRAM) 6095272

Would you know why this might be happening?

Clair3 command used:

run_clair3.sh \
--bam_fn=$IN_ALN \
--ref_fn=$REF \
--threads=16 \
--platform="ont" \
--var_pct_full=0.7 \
--ref_pct_full=0.1 \
--snp_min_af=0.08 \
--indel_min_af=0.15 \
--model_path=$MODEL \
--output=$OUTPUT_DIR \
--remove_intermediate_dir

Kind regards, Nicole

aquaskyline commented 2 weeks ago

Hi Nicole,

Would you be able to show us the variant differences of the replicates you have. At least we want to know where the different variants are at.

nicolechai commented 2 weeks ago

Sure, these are the variant differences that were seen in the replicates:

Replicate 1 Only present in CRAM file:

chr1    46788044    .   TA  T   0.00    LowQual F   GT:GQ:DP:AD:AF  0/1:0:23:5,3:0.1304
chr7    215642  .   C   G   5.64    PASS    F   GT:GQ:DP:AD:AF  0/1:5:27:12,5:0.1852
chr7    215753  .   T   G   5.53    PASS    F   GT:GQ:DP:AD:AF  0/1:5:27:6,4:0.1481
chr7    215855  .   C   T   5.47    PASS    F   GT:GQ:DP:AD:AF  0/1:5:27:12,5:0.1852
chr7    216365  .   TGG T   2.29    PASS    F   GT:GQ:DP:AD:AF  0/1:2:26:13,6:0.2308

Replicate 2 Only present in BAM file:

chr12   31648360        .       GT      G       0.00    LowQual F       GT:GQ:DP:AD:AF  0/1:0:30:6,7:0.2333

Replicate 3 Only present in BAM file:

chr2    175914347       .       G       GA      0.00    LowQual F       GT:GQ:DP:AD:AF  0/1:0:12:5,3:0.2500

Replicate 4 Only present in BAM file:

chr1    57957991        .       C       CT      0.00    LowQual F       GT:GQ:DP:AD:AF  0/1:0:22:5,4:0.1818
chr5    53203136        .       CG      C       9.82    PASS    F       GT:GQ:DP:AD:AF  0/1:9:18:10,7:0.3889

Replicate 5 Only present in CRAM file:

chr17   32741382        .       CA      C       0.00    LowQual F       GT:GQ:DP:AD:AF  0/1:0:28:12,5:0.1786

Replicate 6 Only present in CRAM file:

chr7    115261227       .       C       CT      0.83    LowQual F       GT:GQ:DP:AD:AF  0/1:0:30:24,5:0.1667

Looking in IGV many of these calls appear to be at the start of or within repetitive regions.

aquaskyline commented 2 weeks ago

Thank you for the details, we will start investigating.

nicolechai commented 2 weeks ago

Sounds good, thanks a lot!