HKU-BAL / Clair3

Clair3 - Symphonizing pileup and full-alignment for high-performance long-read variant calling
243 stars 26 forks source link

Excessive GT ./. in GVCF output #220

Closed eiwai81 closed 10 months ago

eiwai81 commented 1 year ago

Hello, Firstly, thank you for this tool. I am working on amplicon sequence data and my workflow involves generating GVCFs with Clair3 with gvcf enabled which I then use as inputs for the CombineGVCFs and GenotypeGVCFs commands in GATK. However, when I compare the sample genotype in the final merge_output.vcf produced by Clair3 with genotype of the same sample in the final combined.vcf from GATK, there are some differences which affects other results downstream. See below.

# 1. Results from individual calls in merge_output.vcf.gz
# sample 1
Pos     REF     ALT     GT
544     C       .       0/0:31:974:850:0.8727:990,990,990
1156    G       .       0/0:31:980:829:0.8459:990,990,990
1544    A       .       0/0:31:984:859:0.873:990,990,990
1665    A       .       0/0:33:981:879:0.896:990,990,990

# sample 2
1156    G       .       0/0:34:97:88:0.9072:990,990,990
1665    A       .       0/0:24:98:94:0.9592:990,990,990

# sample 3
544     C       .       0/0:30:579:493:0.8515:990,990,990
1156    G       .       0/0:32:581:518:0.8916:990,990,990

# 2. Results from joint calling
POS     REF     ALT     Sample 1                                Sample 2                        Sample3
544     C       A       0/1:892,0:.:892:99:1038,0,20356         0/0:11,0:.:11:33:0,33,329       0/1:567,0:.:567:99:875,0,12724
1156    G       A       0/1:940,0:.:940:99:1044,0,21501         0/0:92,0:.:92:99:0,276,2759     0/0:510,0:.:510:99:0,812,13858
1544    A       G       0/1:1013,0:.:1013:99:494,0,23802        0/0:49,0:.:49:27:0,27,1229      0/0:11,0:.:11:3:0,3,269
1665    A       C       0/1:967,0:.:967:93:93,0,23100           0/0:46,0:.:46:18:0,18,1139      0/1:18,0:.:18:95:95,0,335

These are the parameters I used when I generated the GVCFs

run_clair3.sh \
--bam_fn=${sample}.bam \
--ref_fn=${ref} \
--threads=2 \
--platform=${platform} \
--model_path=${model_path} \
--output=${working_dir}/${sample} \
--include_all_ctgs \
--sample_name=${sample} \
--bed_fn=region.bed \
--gvcf \
--chunk_size=25000 \
--var_pct_full=1 \
--ref_pct_full=1 \
--print_ref_calls \    # to generate a vcf even when there are no variants
--no_phasing_for_fa \
--use_whatshap_for_final_output_phasing

I have tried filtering the population VCF using DP, allele counts or frequency thresholds but these have had little or no positive effect. So I was wondering if there are recommendations on some parameters I can modify when generating the GVCFS with Clair3 which would improve the genotyping accuracy of GATK.

Thank you.

nreid commented 1 year ago

I am also very interested in this. I see similar discrepancies with combineGVCFs -> genotypeGVCFs. Heterozygous genotypes are called in the joint GVCF with zero read support for the alternate allele. Other people in the issues mention using GATK, but I can't find any examples showing how to do this correctly. the DBimport function does not work either with clair3 GVCFs.

The readme also mentions using GLNexus to merge GVCFs, but it won't take clair3 output (at least for my genome) as it appears to dislike soft-masked sequence. I get "Invalid: allele is not a DNA sequence" errors for alleles like "Ga"

aquaskyline commented 1 year ago

@nreid does GLNexus work if you simply unmask the masked sequences, like changing "Ga" to "GA"?

eiwai81 commented 1 year ago

@nreid @aquaskyline Thank you for mentioning GLNexus. I have given it a try using the clair.yml file as shown below.

glnexus_cli --config clair3.yml bc*merge_output.gvcf.gz > combined_glnexus_test.bcf
bcftools view combined_glnexus_test.bcf | bgzip -@ 4 -c > combined_glnexus_test.vcf.gz
tabix -p vcf combined_glnexus_test.vcf.gz

The results from GLNexus are less erroneous and more believable, at least in my test. I will close this as completed.

nreid commented 1 year ago

@aquaskyline I just tried to modify the GVCF (not the reference genome) as follows and GLNexus appears to accept the file and produces a joint genotype VCF.

zcat merge_output.gvcf.gz | \
    awk -F'\t' -v OFS='\t' '/^[^#]/{$4=toupper($4)} {$5=toupper($5)} {print $0}' | \
    bgzip >merge_output_converted.gvcf.gz

The genotypes do seem to match the calls in the individual clair3 VCF files, but one thing I'm not clear on is why REF/REF samples with plenty of read coverage are given missing genotypes.

They seem to be passed on from the clair3 GVCF files, where the genotypes are also uncalled, but it's still not clear why. Looking at the alignments in IGV, I don't see any problems.

Here is an example:

From GLNexus, the "II" in the RNC field indicates no call in GVCF:

NC_000020.11    34283744    NC_000020.11_34283744_A_G   A   G   34  .   .   GT:DP:AD:GQ:PL:RNC  0/1:104:40,50:22:32,0,79:.. 0/1:94:45,34:21:34,0,62:..  ./.:60:60,0:0:89,0,1349:II

For the third sample, with the no-call genotype, here is the GVCF for the surrounding sites:

NC_000020.11    34283742    .   T   <NON_REF>   0   .   END=34283742    GT:GQ:MIN_DP:PL 0/0:12:54:0,12,1319
NC_000020.11    34283743    .   G   <NON_REF>   0   .   END=34283743    GT:GQ:MIN_DP:PL 0/0:50:50:0,60,1319
NC_000020.11    34283744    .   A   <NON_REF>   0   .   END=34283744    GT:GQ:MIN_DP:PL ./.:0:60:89,0,1349
NC_000020.11    34283745    .   G   <NON_REF>   0   .   END=34283745    GT:GQ:MIN_DP:PL ./.:0:55:14,0,1304
NC_000020.11    34283746    .   T   <NON_REF>   0   .   END=34283746    GT:GQ:MIN_DP:PL ./.:0:55:194,0,1124

Here is an IGV screenshot of the pileup for the three samples in the same order:

igv_screenshot

I'd appreciate any insight. I'm adding this comment to this issue because it's generally still on topic. The data are ultralong ONT data from the GIAB ashkenazi trio. I am exploring approaches for future work. Having a bunch of no-calls where the data seem to support a reference call is problematic for me. Even filtering down to ballelic SNPs and GLNexus variant Q > 30, I still have 5-10% missing genotypes per sample in this test region.

eiwai81 commented 1 year ago

@aquaskyline May I add that I have similar issues of missing genotypes as mentioned by @nreid. It is hard to understand why given that reads in my dataset are generally 1.5 kb to 10 kb amplicons and the read depths are as high as 1000x. Since my analyses is sensitive to missing data, I let beagle impute the missing genotypes when phasing the SNPs, which may not really the best approach. It would really go a long way if this is something that can be dealt with during the GVCF and VCF generation steps with Clair3.

zhengzhenxian commented 1 year ago

@nreid @eiwai81

Regarding the issue of missing genotypes (represented as "./."), in Clair3 GVCF output, three PL (Phred-scaled likelihood) values are calculated: hom_ref, hom_alt, and het_alt. These values correspond to the likelihood of being homozygous reference, homozygous alternative, and heterozygous alternative, respectively. The calculation of the hom_ref value relies on the reference coverage over the total coverage. If the hom_ref value is not the highest among the three PLs, the genotype changes from "0/0" to "./.".

This logic follows the guidelines set by GATK. The presence of "./." does not indicate a failure to make a genotype call. Instead, it only signifies the candidate site possesses obvious alternative read support, including mismatches and indels.

aquaskyline commented 1 year ago

The logic is correct, but we are giving our implementation a check on extremely high coverage (DP>1000). Get back soon.

nreid commented 1 year ago

@zhengzhenxian thanks for the explanation. There's something I'm not quite grasping though. At position 34283744 in the example above, there are 53 reads in the IGV pileup. 51 of them are A (ref), 2 are C and there are 2 insertions. I'm not clear why the highest likelihood would not be "0/0", as it is at site 34283743, only one base upstream (which has 47 G (ref), 1 A, 5 insertions and 1 deletion), and why, if the highest likelihood is "0/1", is that not the assignment? In the end, the GLNexus output remains "missing", even though there seems to be plenty of read support for a "0/0" genotype call. This seems inconsistent with the behavior of other variant callers and I'm not sure why.

aquaskyline commented 1 year ago

After some investigations, there seems to be problems that lead to excessive number of './.' at very high depth (DP>1000). There are multiple problems including floating point number overflow, so @zhengzhenxian and I need a few more days to figure out a robust solution.

PS - We verified that 'floating point number overflow' was not happening in Clair3's GVCF output module.

ShuminBAL commented 12 months ago

@nreid @eiwai81

The process for generating a GVCF is: first, we calculate the likelihood of three different genotypes: homozygous reference (0/0), homozygous alternate (0/1), and heterozygous alternate (1/1). These likelihoods are computed based on factors such as the total coverage at the site, the number of reference bases, and the number of non-reference bases. Additionally, we determine the AF of each potential variant. If the AF of any variant exceeds a predefined threshold (default: 0.08), the site is considered a candidate for variant calling and is subsequently processed by either the pileup or full alignment model.

Therefore, in cases where a site may appear as if it should be homozygous reference (0/0) but is instead denoted as "./.", the most common reason is that, in the alignment result, there are multiple potential variants at the same position. These variants, combined with a relatively high count of non-reference bases, can influence the likelihood calculations. This can result in the likelihood of 0/1 being the highest, even though the site may not meet the AF threshold for a variant to be called.

For example, given a site with coverage of 151, and alternate allele count as {'-GC': 4, '-GCC': 3, 'G': 3, '-g': 2, '*': 2, '+TT': 1, '-gc': 1, '-G': 1, '-gccg': 1}, #ref_base=133, #alt_base=18. With the default base_err as 0.001, the likelihoods are:

likelihood(0/0) = (1-0.001)^133*0.001^18 = 8.75e-55 likelihood(0/1) = (0.5)^151 = 3.5e-46 therefore, it produces "./."

To address this issue, you can adjust the parameter of the base error rate according to the specific platforms and alignment results you are working with. Based on our experiments, for example, when processing the chr20 of a 144X ONT sample, setting the base_error parameter to 0.03 can significantly reduce the occurrence of missing genotypes.

image

we are planning to provide base_err parameters for various platforms in our upcoming release. Until then, you could manually modify base_err parameter in shared/param_p.py line 26.

nreid commented 11 months ago

Thank you, this explanation makes a lot of sense. So just to clarify one thing, when calculating the likelihoods, you are using this fixed base error probability instead of the actual base qualities in the input bam file?

aquaskyline commented 11 months ago

As Clair3 outputs GVCF using the same logics as GATK, base_err is a fixed prior. Setting it to an average base error rate will be good enough.

eiwai81 commented 11 months ago

@ShuminBAL @aquaskyline Many thanks for the clarification and recommendation. If I understand the solution correctly, one would need to specify the base-calling error rate of the sequencing platform and chemistry (e.g. R.9 or R.10) to the base_err parameter. So in an instance where the average read q_score is 93%, a base_err rate of 0.07 can be used. Is this correct?

aquaskyline commented 11 months ago

@eiwai81. Correct.

eiwai81 commented 11 months ago

@aquaskyline Thanks.

zhengzhenxian commented 10 months ago

@eiwai81 Added --base_err and --gq_bin_size options in v1.0.5.