HKU-BAL / Clair3

Clair3 - Symphonizing pileup and full-alignment for high-performance long-read variant calling
246 stars 27 forks source link

Ref calls not printed with `--print_ref_calls` when `--vcf_fn=FILE` is specified #261

Closed gtollefson closed 8 months ago

gtollefson commented 10 months ago

Hi,

When I run the following clair3 command with --vcf_fn and --print_ref_calls provided, reference records report the genotype as ./. and don't report the other format fields (DP, etc.)

singularity exec \
            -B {params.INPUT_DIR} \
            -B {params.OUTPUT_DIR} \
            -B {params.REF_DIR} \
            -B {params.RESOURCE_DIR} \
            {input.sif_path} \
            /opt/bin/run_clair3.sh \
            --bam_fn={input.samples_to_run} \
            --ref_fn={input.REF} \
            --model_path=/opt/models/{params.MODEL_NAME} \
            --output={params.OUTPUT_DIR} \
            --threads={params.THREADS} \
            --platform=ont \
            --include_all_ctgs \
            --no_phasing_for_fa \
            --sample_name={wildcards.samples} \
            --gvcf \
            --include_all_ctgs \
            --threads=1 \
            --vcf_fn={params.targets_vcf} \
            --print_ref_calls

VCF output example showing missing ref fields:

Screenshot 2024-01-10 at 11 48 37 PM

When I run a the same clair3 command but without --vcf_fn, the reference genotypes are reported as 0/0 and other format fields are reported.

singularity exec \
            -B {params.INPUT_DIR} \
            -B {params.OUTPUT_DIR} \
            -B {params.REF_DIR} \
            -B {params.RESOURCE_DIR} \
            {input.sif_path} \
            /opt/bin/run_clair3.sh \
            --bam_fn={input.samples_to_run} \
            --ref_fn={input.REF} \
            --model_path=/opt/models/{params.MODEL_NAME} \
            --output={params.OUTPUT_DIR} \
            --threads={params.THREADS} \
            --platform=ont \
            --include_all_ctgs \
            --no_phasing_for_fa \
            --sample_name={wildcards.samples} \
            --gvcf \
            --include_all_ctgs \
            --threads=1 \
            --print_ref_calls

VCF output example showing present ref fields:

Screenshot 2024-01-10 at 11 49 51 PM

Is this expected behavior? I'd like to be able to run clair3 to report full records for reference calls with GT, GQ, and DP values when --vcf_fn is specified.

aquaskyline commented 10 months ago

Yes it is expected. In the second example, you enabled --gvcf, which incurs more calculations at the reference calls, thus giving you more details.

gtollefson commented 10 months ago

@aquaskyline Thanks for your response, but I don't think that answers my question since I enabled --gvcf for both examples - not just the second example. --vcf_fn is the only difference between the two runs.

Also the output examples I pasted above are both VCF file output (not gvcf files) from runs with --gvcf enabled which are automatically written along with the gvcf files. The gvcf output from the same run with --gvcf and --vcf_fn enabled does show 0/0 calls. I've pasted a screenshot from this run showing that the gvcf output shows 0/0 calls on the same sites that are reported as ./. in the vcf output.

Ref genotype missing at: chr4 748134 in VCF Screenshot 2024-01-16 at 12 00 39 PM

Ref genotype present at: chr4 748134 in GVCF from same run: Screenshot 2024-01-16 at 12 00 49 PM

This suggests that 0/0 genotypes and associated DP, GQ, AD values aren't reported in the VCF output (but are reported in the gvcf output) when --vcf_fn is specified.

We want to use the VCF output in our pipelines, but also need to run clair3 with '--vcf_fn' while outputting 0/0 reference genotypes and associated fields (DP, GQ, AD etc). Is this possible?

gtollefson commented 10 months ago

@aquaskyline Just following up here. I'm hoping to see if Clair3 can support our use case before we move on to trying out other tools - Clair3 our favorite choice currently but we need to be able to calculate and report DP, AD, etc. for reference calls in non-gvcf output when --vcf_fn is specified.

zhengzhenxian commented 10 months ago

Hi, @gtollefson,

The './.' variants indicate low-AF sites that were not processed by Clair3 and thus not subsequently included in the output. To output these variants as RefCall, you can try setting --snp_min_af=0.0 and --indel_min_af=0.0 in addition to --vcf_fn.

In the next version, we will set --snp_min_af=0.0 and --indel_min_af=0.0 as the default behavior when '--vcf_fn' is applied.

aquaskyline commented 9 months ago

@gtollefson may we collect your feedback on whether the two settings --snp_min_af=0.0 and --indel_min_af=0.0 solve your problem?

gtollefson commented 9 months ago

Hi @aquaskyline @zhengzhenxian,

Thank you very much for your help. Setting --snp_min_af=0.0 and --indel_min_af=0.0 along with --vcf_fn solved my problem. I am able to generate vcfs which report DP, AD, etc. for reference calls in non-gvcf output when --vcf_fn is specified.

However I just want to point out something interesting. The example site that was missing DP, AD, etc before shows an AF of 0.9094.

The './.' variants indicate low-AF sites that were not processed by Clair3 and thus not subsequently included in the output.

This site had high AF in the AF field. Instead could the reason they were excluded be that the sites had no alternate count in the AD field since they are reference calls?

Screenshot of new vcf output showing this site has high AF but no alternate count in AD field (as is expected for a reference call): Screenshot 2024-01-30 at 1 08 49 PM

aquaskyline commented 9 months ago

--snp_min_af and --indel_min_af actually refer to variant AF, a site with reference AF 0.9094 is with variant AF <0.1

gtollefson commented 9 months ago

Ok I see! Thank you. We appreciate your fast response and clean solution. Best, George

aquaskyline commented 8 months ago

Fixed in v1.0.6.