bvilhjal / ldpred

MIT License
96 stars 57 forks source link

Bug in opposite strand checking in coord #92

Closed piotor87 closed 4 years ago

piotor87 commented 5 years ago

I'm running a bunch of PRS for many datasets and I ran into a probable bug in the coord phase. I noticed that this debug print seemed suspicious:

Coordinating data for chromosome chrom_9
Found 397889 SNPs on chrom 9 that were common across all datasets
Ordering SNPs by genomic positions (based on LD reference genotypes).
Nucleotide concordance counts out of 397889 genotypes, rg-ss: 362903
Nucleotides don't match after all?: g_sid=chr9_42729141_C_T, ss_sid=chr9_42729141_C_T, g_i=217616, ss_i=160113, g_nt=['T', 'C'], ss_nt=['A' 'G']

This is just the head of the output but this happens systematically. As you can see TC is not mapped to AG, although it clearly should be.

Looking at the code it seems that there is a bug in the logical operations with python at line 340 of coord_genotypes.py. After the opposite strand variant is calculated:

os_g_nt = sp.array([util.opp_strand_dict[g_nt[0]], util.opp_strand_dict[g_nt[1]]])
flip_nts = False

Then the script checks if either the original or the opposite strand variant match the variant from the sumstat file:

if not sp.all(g_nt == ss_nt) or sp.all(os_g_nt == ss_nt):

However, the first logical operation seems to be wrong. The not operation only applies to the first boolean evaluation, thus causing the aforementioned output. It should be

if not (sp.all(g_nt == ss_nt) or sp.all(os_g_nt == ss_nt)):

In fact, the following step is built this way and it works fine. First the or boolean operation is defined and then it's used in the if statement:

flip_nts = (g_nt[1] == ss_nt[0] and g_nt[0] == ss_nt[1]) or (os_g_nt[1] == ss_nt[0] and os_g_nt[0] == ss_nt[1])
if flip_nts:
   ...
babak-ra commented 5 years ago

This is fixed in https://github.com/bvilhjal/ldpred/pull/88.