fgvieira / ngsLD

Calculation of pairwise Linkage Disequilibrium (LD) under a probabilistic framework
GNU General Public License v2.0
45 stars 7 forks source link

LD values are NA for r2 EM and D prime EM #44

Open aguilar-gomez opened 1 year ago

aguilar-gomez commented 1 year ago

Hi,

I ran ngsLD and am interested in the r2 from the EM algorithm. When I am looking at the values for some positions, the D is negative, Dprime and r2 are nan. Do you know what is happening or how to fix it? I printed the extended output to see if that helps.

pos1 pos2 distance r2p D Dp r2 nsamples mafpos1 mafpos2 hap00 hap01 hap10 hap11 hap_maf1 hap_maf2 chi2 loglike nIter
P_RNA_scaffold_10726:56595 P_RNA_scaffold_10726:58826 2231 0.103631 -0.045278 NA NA 49 0.664285 0.155807 0.206242 0.069226 0.706824 0.017708 0.724532 0.086934 0.129409 0 11
P_RNA_scaffold_10726:56585 P_RNA_scaffold_10726:58826 2241 0.118901 -0.045282 NA NA 49 0.667382 0.155807 0.206464 0.069283 0.706497 0.017757 0.724254 0.08704 0.129204 0 11
P_RNA_scaffold_10726:35711 P_RNA_scaffold_10726:58826 23115 0.091501 -0.028652 NA NA 46 0.657558 0.155807 0.232471 0.051776 0.686177 0.029577 0.715753 0.081353 0.053992 0 11
P_RNA_scaffold_10726:23779 P_RNA_scaffold_10726:58826 35047 0.12464 -0.044908 NA NA 49 0.687809 0.155807 0.199039 0.06824 0.713665 0.019055 0.73272 0.087295 0.129248 0 12
fgvieira commented 1 year ago

Can you send the ngsLD command you used?

aguilar-gomez commented 1 year ago

ngsLD --n_threads 10 --geno CMScaffoldsofInterestMaf5.beagle.gz --n_ind 63 --n_sites 1145 --outH CMLDMaf5probs_thre --pos scaffoldsOfInterestmaf5.sites --max_kb_dist 0 --max_snp_dist 0 --extend_out --probs --call_geno --call_thresh .9 --min_maf 0.1 --ignore_miss_data

fgvieira commented 1 year ago

The problem is that site P_RNA_scaffold_10726:58826 has a lower freq (when calculated through the haplotypes) than --min_maf 0.1. The idea was to exclude LD from sites with very low freq, but I guess it might be better to leave that filtering to the user.

aguilar-gomez commented 1 year ago

Why is it differing the maf calculated by angsd vs the haplotype based calculation? I think it would be useful that position was completely excluded from the output, having the nan just makes it harder for plotting and interpreting. Or maybe having also a column with the calculated maf by haplotypes to understand what is happening.

fgvieira commented 1 year ago

The positions where maf is lower than --min_maf are excluded, but there is no filter on haplot freqs because these are calculated pairwise between all pairs of SNPs.

And you do have the haplot freqs in the extended output (hap_maf1 and hap_maf2).