cggh / scikit-allel

A Python package for exploring and analysing genetic variation data
MIT License
287 stars 50 forks source link

LD r greater than 1, then fixes if I rerun it #308

Open raqueldias opened 4 years ago

raqueldias commented 4 years ago

whenever I try to run the rogers_huff_r_between function, or the inner function called by it, sometimes I get values greater than 1 in my results. The most interesting thing is that I get a different values sometimes if I try to rerun it, for example, I get the following output when I run the code bellow in jupyterhub notebook (unfortunately I cannot share the genotypes). Any clues on why rerunning the same code with same inputs would change the values?

Output: python versus cython 5.006416 5.006416 recalculated, python 0.07222055

code I running after loading VCF: CAD_g = allel.GenotypeArray(CAD_GT,dtype='i1') GA_g = allel.GenotypeArray(GA_GT,dtype='i1')

            ac = GA_g.count_alleles()
            af = ac.to_frequencies()
            variant_selection = (af[:,1]>=0.001) & (af[:,1]<=0.999)
            GA_g = GA_g.subset(variant_selection)
            CAD_gn = CAD_g.to_n_alt(fill=-1)
            GA_gn = GA_g.to_n_alt(fill=-1)

            r = allel.rogers_huff_r_between(CAD_gn,GA_gn)
            rB = allel.stats.ld.gn_pairwise2_corrcoef_int8(CAD_gn,GA_gn)
            r2 = r ** 2
            rB2 = rB ** 2
            my_max = np.max(np.nan_to_num(r2,0))

            my_maxB = np.max(np.nan_to_num(rB2,0))

            max_r2.append(my_maxB)

            if(my_max>1.0 or my_maxB>1.0):
                print("python versus cython", my_max, my_maxB)
                r = allel.rogers_huff_r_between(CAD_gn,GA_gn)
                r2 = r ** 2
                my_max = np.max(np.nan_to_num(r2,0))
                print("recalculated, python", my_max)
alimanfoo commented 4 years ago

Hi @raqueldias, sorry for slow reply. I'm surprised that the same outputs would give different results, outside the bounds of floating point error. Would it be possible to share a minimal dataset where this occurs?

raqueldias commented 4 years ago

I think I figured out what was going on. GA_g and CAD_g in the example above accidentally have different numbers of variants and samples. So if rogers_huff_r_between function doesn't expect inputs with different shapes, I can predict that many artifacts might have been generated when making the haplotypes for calculating LD. But no warning was printed out, so I didn't notice the problem until I start to calculate LD by hand in a peace of paper haha... I recommend that a little sanity check could provide a helpful warning if the data shapes don't match.

alimanfoo commented 4 years ago

Ah, OK, starting to make sense. It's a while since I used those functions but IIRC the idea for rogers_huff_r_between is to calculate LD between variants from two different regions of the genome, but with the same samples. So the numbers of variants in the two input arrays can be different, but the numbers of samples should be the same, and in fact they should be data from the same samples otherwise it's a bit meaningless. (I used this function to calculate LD decay curves.) Definitely some checks and some better documentation needed here.