Open raonyguimaraes opened 3 months ago
Hi!
Apologies for being unresponsive. I was swamped with new stuff, but I have some time now trying to fix these bugs.
If I understand correctly, you get a result file like this at the end of the run.
#SAMPLE HQ_HOM HQ_HOM_RATE HQ_HET HQ_HET_RATE CHARR MEAN_REF_AB_HOM_ALT HETEROZYGOSITY_RATE INCONSISTENT_AB_HET_RATE
PGP0622076 0 -nan 3712593 0.99375 -nan -nan 1.00000 0.30501
In this case, it seems that the tool cannot find any high-quality ALT homozygous site in your VCF. To determine HQ homozygous genotypes, the tool first skips indels and variants outside configured AF limits and then considers the configured GQ and DP threshold for variants within the.
Here, you have HQ_HOM sites == 0
. Thus, it may be that in your VCF file, there are no ALT homozygous genotypes satisfying all the criteria.
Given that there are no HQ homozygous sites left, the CHARR value and AB value on HQ sites can not be computed, and thus, you have nan
values in the results.
Indeed, the heterozygosity rate in your results is 1.0, which means that across HQ-selected sites, you have only heterozygous genotypes.
Can you check if there is any genotype passing the HQ criteria?
You can get examples of HQ variants using bcftools.
bcftools query -i 'GT="AA" & FMT/DP >= 20 & FMT/DP <= 100 & GQ >= 20 & TYPE="snp" & AF <= 0.9 & AF >= 0.1' -f '[%CHROM %POS %REF %ALT %SAMPLE %GT %GQ %DP\n]' input.vcf.gz
While I perform a final test, you can try the new executable I posted in #1.
If no HQ data is available after filtering, your sample will still show' nan' values, but you should see a better log message warning you about this.
Let me know how it works for you.
I've posted a new release v0.1.3. Let me know if this solves the issue, at least in terms of the better log.
When running sce-vcf for one sample I get:
[12:06:24-884] - INFO: SCE-VCF v0.1.2 [12:06:24-884] - ARG: Input VCF: @["sample.sortdup.bqsr.hc.vcf.gz"] [12:06:24-884] - ARG: Output: stdout [12:06:24-884] - ARG: AD field: AD [12:06:24-884] - ARG: AF field: AF [12:06:24-884] - ARG: Is refAF: false [12:06:24-884] - ARG: REF AF limits: 0.1,0.9 [12:06:24-884] - ARG: Het AB limit: 0.25,0.75 [12:06:24-884] - ARG: Min GQ: 20 [12:06:24-884] - ARG: DP limit: 20,100
SAMPLE HQ_HOM HQ_HOM_RATE HQ_HET HQ_HET_RATE CHARR MEAN_REF_AB_HOM_ALT HETEROZYGOSITY_RATE INCONSISTENT_AB_HET_RATE
[12:06:24-907] - INFO: Variant processing started ... [12:07:00-100] - INFO: 6160954 variants processed, 1960156 vars ignored: 87866 multiallelic, 989349 indels, 882941 outside ref AF limits, 0 missing AF tag [12:07:00-101] - INFO: Computing contamination values PGP0622076 0 -nan 3712593 0.99375 -nan -nan 1.00000 0.30501 [12:07:00-101] - INFO: Computed contamination values for 1 sample(s) in 33 seconds and 974 milliseconds
I tried normalizing it with "bcftools view -f PASS,. -m2 -M2 -v snps" without much success:
[12:14:15-051] - INFO: 5033767 variants processed, 882941 vars ignored: 0 multiallelic, 0 indels, 882941 outside ref AF limits, 0 missing AF tag [12:14:15-052] - INFO: Computing contamination values PGP0622076 0 -nan 3712593 0.99375 -nan -nan 1.00000 0.30501 [12:14:15-052] - INFO: Computed contamination values for 1 sample(s) in 22 seconds and 540 milliseconds
Anyway to debug this error? There should be a --debug function. :D
What else can I do to help get this fixed?
Best Regards,