bioXiaoheng / BalLeRMix

Software package for BalLeRMix and scripts used in the study "Flexible mixture model approaches that accommodate footprint size variability for robust detection of balancing selection" (Cheng & DeGiorgio 2020)
4 stars 1 forks source link

Interpreting lines with LR=0, numSites=0 #12

Open deboraycb opened 1 year ago

deboraycb commented 1 year ago

Hi Xiaoheng,

I have quite a few lines on my output that look like this:

physPos  genPos   LR xhat      Ahat  numSites 
10000033 33.39164 0  0.95,0.05 1e+16 0 

I thought this was because of the LR threshold of -100, but I changed it to -1000 and still get the same results. How should I interpret those? Should results from these SNPs be excluded because they indicate some problem in the data or should they count as evidence of no balancing selection at those sites?

Thank you! Debora

deboraycb commented 1 year ago

(sorry, forgot to add something to my previous message)

I also found it quite surprising that the site with LR=0 (which I thought meant LR < -100) is only 102 bp away from a site with LR=3.98, so I don't think I really understand what those LR=0 lines in the output mean. Any help is much appreciated! Thanks!

physPos   genPos           LR      xhat        Ahat numSites 
10000033 33.39164  0.000000000 0.95,0.05 1.00000e+16        0 
10000135 33.39247  3.981366165 0.85,0.15 3.16228e+04       22 
bioXiaoheng commented 1 year ago

Hi Debora,

Nice that you noticed this behavior! In fact, it's entirely expected that the lowest value of (maximized) LR be 0, because it's the (log) ratio of two (composite) likelihoods (from alternative & null hypothesis, respectively) that are both included in the model. In other words, when we search the parameter space to maximize the LR, we know for sure there at least exists one fixed value of LR=0 when the alternative model takes the same parameter as the null. So rest assured that it's nothing abnormal =). You may have noticed negative LRs from a previous version because its parameter space wasn't optimized.

Hope that helps!

Best, Xiaoheng

deboraycb commented 6 months ago

Hi Xiaoheng,

I am coming back to this after a long break. Thank you for your reply!

So, from your response and from looking at the calcBaller function, I understood that those lines with LR=0 and numSites=0 are cases where there was no LR>0 with any of the parameter combinations tested. Therefore, I should just interpret those as reliable sites having no evidence of balancing selection. Is that correct?

However, I was still puzzled by

1) the fact that I have lots of line like this (LR=0,numsites=0) in my output (about 50%), and 2) some cases where a site with LR=0,numsites=0 was surrounded by very close sites with LR>0, such as this (note these sites are separated by only 12bp):

physPos genPos  LR      xhat    Ahat    numSites
9921419 33.27157287885831       17.978688147283947      0.75,0.25       31622.8 142
9921429 33.27158812465831       0       0.95,0.05       1e+16   0
9921431 33.27159117381831       8.315634381664495       0.75,0.25       31622.8 143

It is very possible that I am misunderstanding something about how Ballermix works, but I thought it was weird that LR>0 was found for the first and last sites, but not for the middle one, given that they are so close. Does that surprise you too?

Thanks for all your help! Best, Debora

deboraycb commented 5 months ago

Hi Xiaoheng,

I don't know if you had time to look at my previous question, but I have explored that issue a bit more in the mean time, and I'll include my findings here - although I'm not completely sure how to interpret them.

I have printed out the optimized windows for the sites I mentioned above:

physPos genPos             LR                 xhat      Ahat    numSites winStart winEnd
9921419 33.27157287885831  17.978688147283947 0.75,0.25  31622.8 142     9921036 9921802
9921429 33.27158812465831  0                  0.95,0.05  1e+16   0       -       -
9921431 33.27159117381831  8.315634381664495  0.75,0.25  31622.8 143     9921048 9921819

and I also looked at the input lines for those sites:

physPos genPos          x   n
9921419 33.27157287885831       8       196
9921429 33.27158812465831       144     196
9921431 33.27159117381831       1       196

and I plotted the SFS for the optimized windows for sites ending in 19 and 31: SFS9921419 SFS9921431

I am still puzzled by the fact that the first site (position ending in 19) and third site (position ending in 31) have very much overlapping windows, which also overlap with the middle site (ending in 29), and that Ballermix finds a LR of 17 and 8 for both sites, but does not find any LR>0 for the middle site. Is that behaviour expected? As I said before, I get lots of sites in my results with such LR=0 lines (around 50%), and I just want to make sure these results are valid.

Thanks again! Best, Débora

deboraycb commented 5 months ago

I am attaching here the ballermix input and output files for this example, as well as the command I ran: python3.10 BalLeRMix/software/BalLeRMix_v2.5.py -i chr2L_10kbaround9921kb.balmixder.minrec.txt -o chr2L_10kbaround9921kb_w1000.B2.txt --spect Zambia_concat.B2spect_DAF.txt -w 1000

Zambia_concat.B2spect_DAF.txt chr2L_10kbaround9921kb.balmixder.minrec.txt chr2L_10kbaround9921kb_w1000.B2.txt