Open raqueldias opened 4 years ago
Hi @raqueldias, sorry for slow reply. I'm surprised that the same outputs would give different results, outside the bounds of floating point error. Would it be possible to share a minimal dataset where this occurs?
I think I figured out what was going on. GA_g and CAD_g in the example above accidentally have different numbers of variants and samples. So if rogers_huff_r_between function doesn't expect inputs with different shapes, I can predict that many artifacts might have been generated when making the haplotypes for calculating LD. But no warning was printed out, so I didn't notice the problem until I start to calculate LD by hand in a peace of paper haha... I recommend that a little sanity check could provide a helpful warning if the data shapes don't match.
Ah, OK, starting to make sense. It's a while since I used those functions but IIRC the idea for rogers_huff_r_between
is to calculate LD between variants from two different regions of the genome, but with the same samples. So the numbers of variants in the two input arrays can be different, but the numbers of samples should be the same, and in fact they should be data from the same samples otherwise it's a bit meaningless. (I used this function to calculate LD decay curves.) Definitely some checks and some better documentation needed here.
whenever I try to run the rogers_huff_r_between function, or the inner function called by it, sometimes I get values greater than 1 in my results. The most interesting thing is that I get a different values sometimes if I try to rerun it, for example, I get the following output when I run the code bellow in jupyterhub notebook (unfortunately I cannot share the genotypes). Any clues on why rerunning the same code with same inputs would change the values?
Output: python versus cython 5.006416 5.006416 recalculated, python 0.07222055
code I running after loading VCF: CAD_g = allel.GenotypeArray(CAD_GT,dtype='i1') GA_g = allel.GenotypeArray(GA_GT,dtype='i1')