AlphaGenes / AlphaPeel

AlphaPeel: calling, phasing, and imputing genotype and sequence data in pedigrees
MIT License
2 stars 11 forks source link

Fix the bug of writeCalledPhase() #82

Closed XingerTang closed 8 months ago

XingerTang commented 1 year ago

When the probabilities of having both alleles are equal and the calling threshold is less than 0.5, the function writeCalledPhase() would directly choose the reference allele, which hides the possibility of having the alternative allele.

def writeCalledPhase(pedigree, genoProbFunc, outputFile, thresh):
    with open(outputFile, 'w+') as f:
        for idx, ind in pedigree.writeOrder():
            matrix = genoProbFunc(ind.idn)

            # Paternal            
            paternal_probs = np.array([matrix[0,:] + matrix[1,:], matrix[2,:] + matrix[3,:]], dtype=np.float32)
            paternal_haplotype = np.argmax(paternal_probs, axis = 0)
            setMissing(paternal_haplotype, paternal_probs, thresh)
            f.write(ind.idx + ' ' + ' '.join(map(str, paternal_haplotype)) + '\n')

            #Maternal
            maternal_probs = np.array([matrix[0,:] + matrix[2,:], matrix[1,:] + matrix[3,:]], dtype=np.float32)
            maternal_haplotype = np.argmax(maternal_probs, axis = 0)
            setMissing(maternal_haplotype, maternal_probs, thresh)
            f.write(ind.idx + ' ' + ' '.join(map(str, maternal_haplotype)) + '\n')

If the input genotype probabilities are the following

0
0.5
0.5
0

then both paternal_probs and maternal_probs would become [0.5, 0.5]. np.argmax(paternal_probs, axis=0) would return the index of the highest elements of pathernal_probs and maternal_probs, and when there are two or more equal maximum values occur, it would return the first index of the maximum number. So in this case it would directly return the index 0, which corresponds to the reference allele. If the calling threshold is less than 0.5, the setMissing() would keep the 0 produced by the np.argmax() instead of setting it missing.

The reason why a low calling threshold is allowed is that it ought to be used for the called genotypes. For called genotypes, it is still possible that the genotype with the highest probability is still around 0.3. For example,

0.31
0.46
0.23

But the genotype would be marked missing if the calling threshold is higher than 0.31.

For called haplotype output, there are only two possibilities for each locus and each haplotype. It is impossible that the highest possibility of any allele is less than 0.5, so a calling threshold lower than 0.5 makes no sense.

One way to get around this is by separating the calling threshold input for called genotypes and called haplotypes. And put a lower bound of 0.5 for the calling threshold for the called haplotypes (similarly, we should also put a lower bound of 0.33 for the called genotypes).

I'm also thinking about maybe evaluating the percentage of the difference between the highest possibility and the next possibility to the highest possibility is a better way to select the reliable output. But it would take a longer time to compute the called output.

XingerTang commented 1 year ago

@gregorgorjanc As we are going to introduce a new option -haps_call_threshold for the called haplotypes output, can we remove the -call_phase option? Since -call_phase is only used for letting AlphaPeel know that the user wants the called haplotype output, while -haps_call_threshold is only useful when the user wants the called haplotypes output. We can let -haps_call_threshold do both that the AlphaPeel is going to output the called haplotypes if and only if a haplotype calling threshold is input. It would be more consistent with the way we use the genotype calling threshold as well, as the called genotype files are outputted if and only if a genotype calling threshold is given.

gregorgorjanc commented 1 year ago

@XingerTang let's leave both arguments for now, though I agree with you that I it feels a bit redundant.

XingerTang commented 11 months ago

We want -geno, -geno_threshold with output names .geno_x.txtand -hap, -hap_threshold with output names .hap_x.txt.

AprilYUZhang commented 8 months ago

@XingerTang Hi, Evie, I guess we have dealt with this issue. Could you please help close it? I'm checking the table that records our task and will close all issues that we have finished.

XingerTang commented 8 months ago

@AprilYUZhang Sure!