HTGenomeAnalysisUnit / SCE-VCF

Sample Contamination Estimate from VCF
MIT License
18 stars 0 forks source link

-nan CHARR contamination values #2

Open raonyguimaraes opened 3 months ago

raonyguimaraes commented 3 months ago

When running sce-vcf for one sample I get:

[12:06:24-884] - INFO: SCE-VCF v0.1.2 [12:06:24-884] - ARG: Input VCF: @["sample.sortdup.bqsr.hc.vcf.gz"] [12:06:24-884] - ARG: Output: stdout [12:06:24-884] - ARG: AD field: AD [12:06:24-884] - ARG: AF field: AF [12:06:24-884] - ARG: Is refAF: false [12:06:24-884] - ARG: REF AF limits: 0.1,0.9 [12:06:24-884] - ARG: Het AB limit: 0.25,0.75 [12:06:24-884] - ARG: Min GQ: 20 [12:06:24-884] - ARG: DP limit: 20,100

SAMPLE HQ_HOM HQ_HOM_RATE HQ_HET HQ_HET_RATE CHARR MEAN_REF_AB_HOM_ALT HETEROZYGOSITY_RATE INCONSISTENT_AB_HET_RATE

[12:06:24-907] - INFO: Variant processing started ... [12:07:00-100] - INFO: 6160954 variants processed, 1960156 vars ignored: 87866 multiallelic, 989349 indels, 882941 outside ref AF limits, 0 missing AF tag [12:07:00-101] - INFO: Computing contamination values PGP0622076 0 -nan 3712593 0.99375 -nan -nan 1.00000 0.30501 [12:07:00-101] - INFO: Computed contamination values for 1 sample(s) in 33 seconds and 974 milliseconds

I tried normalizing it with "bcftools view -f PASS,. -m2 -M2 -v snps" without much success:

[12:14:15-051] - INFO: 5033767 variants processed, 882941 vars ignored: 0 multiallelic, 0 indels, 882941 outside ref AF limits, 0 missing AF tag [12:14:15-052] - INFO: Computing contamination values PGP0622076 0 -nan 3712593 0.99375 -nan -nan 1.00000 0.30501 [12:14:15-052] - INFO: Computed contamination values for 1 sample(s) in 22 seconds and 540 milliseconds

Anyway to debug this error? There should be a --debug function. :D

What else can I do to help get this fixed?

Best Regards,

edg1983 commented 1 month ago

Hi!

Apologies for being unresponsive. I was swamped with new stuff, but I have some time now trying to fix these bugs.

If I understand correctly, you get a result file like this at the end of the run.

#SAMPLE HQ_HOM HQ_HOM_RATE HQ_HET HQ_HET_RATE CHARR MEAN_REF_AB_HOM_ALT HETEROZYGOSITY_RATE INCONSISTENT_AB_HET_RATE
PGP0622076 0 -nan 3712593 0.99375 -nan -nan 1.00000 0.30501

In this case, it seems that the tool cannot find any high-quality ALT homozygous site in your VCF. To determine HQ homozygous genotypes, the tool first skips indels and variants outside configured AF limits and then considers the configured GQ and DP threshold for variants within the. Here, you have HQ_HOM sites == 0. Thus, it may be that in your VCF file, there are no ALT homozygous genotypes satisfying all the criteria. Given that there are no HQ homozygous sites left, the CHARR value and AB value on HQ sites can not be computed, and thus, you have nan values in the results.

Indeed, the heterozygosity rate in your results is 1.0, which means that across HQ-selected sites, you have only heterozygous genotypes.

Can you check if there is any genotype passing the HQ criteria?

You can get examples of HQ variants using bcftools.

bcftools query -i 'GT="AA" & FMT/DP >= 20 & FMT/DP <= 100 & GQ >= 20 & TYPE="snp" & AF <= 0.9 & AF >= 0.1' -f '[%CHROM %POS %REF %ALT %SAMPLE %GT %GQ %DP\n]' input.vcf.gz
edg1983 commented 1 month ago

While I perform a final test, you can try the new executable I posted in #1.

If no HQ data is available after filtering, your sample will still show' nan' values, but you should see a better log message warning you about this.

Let me know how it works for you.

edg1983 commented 1 month ago

I've posted a new release v0.1.3. Let me know if this solves the issue, at least in terms of the better log.