Illumina / GTCtoVCF

Script to convert GTC/BPM files to VCF
Apache License 2.0
41 stars 30 forks source link

Indel haploid calls printed as diploid #21

Closed brianhill11 closed 5 years ago

brianhill11 commented 6 years ago

Hi there,

We came across this issue while trying to run bcftools norm command to merge multi-line variants into a single line. Here is the command and error:

bcftools norm -m +any -Oz -o output.sample1234.vcf.gz sample1234.vcf.gz
Error at X:48386632: cannot combine diploid with haploid genotype

It was trying to combine these records (sample seems to be a male):

X       48386632        seq-rs797045547 T       TG      .       PASS    .       GT:GQ   0:0.474965
X       48386632        seq-rs587783616 T       G       .       PASS    .       GT:GQ   0/1:0.409332

The 0/1 genotype call doesn't make sense to me for a male sample. It seems that there are many calls on the X chromosome that are haploid, and many that are diploid. My guess is that this piece of the code is part of the issue:

def format_vcf_genotype(vcf_allele1_char, vcf_allele2_char, ploidy):
    """
    Create a VCF representation of the genotype based on alleles. Format appropriately if haploid.
    Args:
        vcf_allele1_char (string): 0,1,2 etc.
        vcf_allele2_char (string): 0,1,2, etc.
        vcf_record (vcf._Record): Record for the entry analyzed
        ploidy(int): Expected ploidy.
    Returns
        string: String representation of genotype (e.g., "0/1"), adjusted for haploid calls if applicable
    """
    assert ploidy < 3

   if ploidy == 1:
        if vcf_allele1_char == vcf_allele2_char:
            return str(vcf_allele1_char)

    vcf_genotype = ""
    if vcf_allele2_char < vcf_allele1_char:
        vcf_genotype = str(vcf_allele2_char) + "/" + str(vcf_allele1_char)
    else:
        vcf_genotype = str(vcf_allele1_char) + "/" + str(vcf_allele2_char)
    return vcf_genotype

Since indels are printed out as either 0/1 or 1/0 (see the convert_indel_genotype_to_vcf function) the check if vcf_allele1_char == vcf_allele2_char: never evaluates to true, and therefore falls through and gets printed as a diploid call.

My thought was that something like this might make more sense:

if ploidy == 1:
        if vcf_allele1_char == vcf_allele2_char:
            return str(vcf_allele1_char)
        else: 
            return str(max(vcf_allele1_char, vcf_allele2_char))

Does that seem reasonable? If so, I can open a pull request.

Thanks, Brian

KelleyRyanM commented 6 years ago

@brianhill11 , The existing behavior is intentional, although I suppose it is debatable whether it is appropriate. For these cases, you need to either decide to

neither of which is desirable. I ultimately went with the latter, as the information is still available to filter these after the fact. I don't think I would want to arbitrarily force one of the homozygous genotypes, as that has the potential to further mask the issue downstream and introduce erroneous calls. Other options might include

brianhill11 commented 6 years ago

Hi Ryan,

Thanks for the reply. It sounds like this is best handled downstream, and not during conversion.

I'm still trying to wrap my head around what a 0/1 (or 1/0) call would mean in the haploid situation. How could you have both the reference and the alternate when only one should be possible?

Thanks, Brian