bioXiaoheng / BalLeRMix

Software package for BalLeRMix and scripts used in the study "Flexible mixture model approaches that accommodate footprint size variability for robust detection of balancing selection" (Cheng & DeGiorgio 2020)
4 stars 1 forks source link

Error in scan_alpha Tmax object (TypeError: 'int' object is not iterable) #9

Closed deboraycb closed 1 year ago

deboraycb commented 1 year ago

Hello,

Thanks for the great method and software! I have been getting this error message when I try to run B2_maf (full output copied at the bottom of this message)

File "BalLeRMix/software/BalLeRMix_v2.4.py", line 653, in scan_alpha
    phys_pos[i], pos_list[i], Tmax[0], ','.join(['%g' % (_) for _ in Tmax[1]]), Tmax[2], optWinSize))
TypeError: 'int' object is not iterable

I noticed there was a try/except commented out in that part of the code:

651             # try:
  1             scores.write('%s\t%s\t%s\t%s\t%g\t%d\n' % (
  2                 phys_pos[i], pos_list[i], Tmax[0], ','.join(['%g' % (_) for _ in Tmax[1]]), Tmax[2], optWinSize))
  3             # except:
  4             #    print(Tmax)
  5             #    sys.exit()

and when I uncomment those lines, the Tmax that is printed out is [-100,0,0] (in case that helps). Any hints about what could be going wrong?

This is my command line: python BalLeRMix/software/BalLeRMix_v2.4.py -i Zambia_imiss0.4_lmiss0.2_chr3R.balmixderrec -o Zambia_imiss0.4_lmiss0.2_chr3R.B2 --spect Zambia_imiss0.4_lmiss0.2_chr3R.B2spect_MAF --MAF and here is my input file: https://www.dropbox.com/s/lgclhkilpc13sgg/Zambia_imiss0.4_lmiss0.2_chr3R.balmixderrec?dl=0 and the spec file: https://www.dropbox.com/s/7j6x9bqna0guncq/Zambia_imiss0.4_lmiss0.2_chr3R.B2spect_MAF?dl=0

Best, Débora


Here is the full output:

Optimizing over x= (0.5,0.5); (0.55,0.45); (0.6,0.4); (0.65,0.35); (0.7,0.3); (0.75,0.25); (0.8,0.2); (0.85,0.15); (0.9,0.1); (0.95,0.05).                                                                         

Optimizing over A= 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1100, 1200, 1400, 1600, 1800, 2000, 2200, 2400, 2500, 3000, 3500, 4000, 4500, 5000, 6000, 7000, 8000, 9000, 10000, 1000000.0, 1000000000.0.  

2023-06-22 16:06:02.814370. Initializing...                                                                                                                                                                        
Total prob is 1.000000 in the spectrum file.                                                                                                                                                                       
Sample size: 196                                                                                                                                                                                                   
Using minor allele frequencies...                                                                                                                                                                                  

2023-06-22 16:06:14.217754. Reading input file: ../data/Zambia_imiss0.4_lmiss0.2_chr3R.balmixderrec                                                                                                                

2023-06-22 16:06:19.217762. Start computing likelihood ratios...                                                                                                                                                   
Computing LR on every 1 site/s, using all the data with alpha >= 1e-8.                                                                                                                                             
writing output to ../results/ballermix/Zambia_imiss0.4_lmiss0.2_chr3R.B2                                                                                                                                           
Traceback (most recent call last):                                                                                                                                                                                 
  File "/home/debora/Dropbox/private/projects/uclpostdoc/balsel_scan/scripts/BalLeRMix/software/BalLeRMix_v2.4.py", line 855, in <module>                                                                          
    main()                                                                                                                                                                                                         
  File "/home/debora/Dropbox/private/projects/uclpostdoc/balsel_scan/scripts/BalLeRMix/software/BalLeRMix_v2.4.py", line 847, in main
    scan(xGrid, AGrid, AdGrid, aGrid, opt.outfile, phys_pos, pos_list, Ks, Ns, numSites, Spec, logSpec, Fx, N, opt.size,
  File "/home/debora/Dropbox/private/projects/uclpostdoc/balsel_scan/scripts/BalLeRMix/software/BalLeRMix_v2.4.py", line 695, in scan
    scan_alpha(xGrid, AGrid, AdGrid, aGrid, outfile, phys_pos, pos_list, Ks, Ns, numSites, Spec, logSpec, Fx, N, s,
  File "/home/debora/Dropbox/private/projects/uclpostdoc/balsel_scan/scripts/BalLeRMix/software/BalLeRMix_v2.4.py", line 653, in scan_alpha
    phys_pos[i], pos_list[i], Tmax[0], ','.join(['%g' % (_) for _ in Tmax[1]]), Tmax[2], optWinSize))
TypeError: 'int' object is not iterable
bioXiaoheng commented 1 year ago

Hi Debora,

Thanks for using the software. I noticed that in your input, all the genetic positions (second column from the left) are 0.0. Can you try running your command again with --physPos tag?

The script uses genetic positions (in cM) by default so everyone having the same position might have confused it. By using the --physPos tag, you'd instruct it to use physical positions instead. That said, it'd still assume a 1e-6 nt/cM recombination rate to convert physical distance to mapping distance. If your organism-of-interest has a different average recombination rate, it'd be better if you feed this value (in unit of "nt/cM" ofc) to the program through --rec flag.

Let me know how it goes!

Cheers, Xiaoheng

deboraycb commented 1 year ago

Hi Xiaoheng,

Thanks for your quick response! It does work with --physPos. However, since I do have a recombination map, I would like to use that information. The second column of my input file begins with lots of 0.0 because in the recombination map for my species (Drosophila melanogaster) the recombination rate in that part of the chromosome is truly zero. Around physical position 700kb there are larger recombination rates. How should I calculate genetic position for those regions of zero recombination? I also tried defining the genetic position for those lines as 1.0 instead of 0.0, but got the same error.

Cheers! Debora

bioXiaoheng commented 1 year ago

Hi Debora,

Good to know --physPos works! At least now we know the issue is indeed that different SNPs have the same position.

I noticed that for your input file of 2,905,170 SNPs, there are only 59 unique values for genetic positions. May I ask how you parsed the recombination map?

I have only worked with humans so far so it might not apply to Drosophila, but my approach was to assume uniform rates within each gap between checkpoints where we have mapping data. For example, for SNPs A and B whose genetic positions are mapped to be $L_1$ and $L_2$, then for SNP C between them (with physical distance $d_A$ and $d_B$. Note that $d_A + d_B$ and $L_2 - L_1$ describe the same gap but in different units), its genetic position could be imputed as $L_3 = L_1 + \frac{d_A }{ d_A + d_B}(L_2 - L_1) $.

A side note, maybe you already know this, but here are some parsing scripts I posted for a spin-off project for reference: https://github.com/bioXiaoheng/BallerMixPlus/tree/main/parsing_scripts.

Hope that helps!

Xiaoheng

deboraycb commented 1 year ago

Hi Xiaoheng,

It looks like the recombination map that I have from Drosophila is in a different format. The first few lines of the file looks like this:

1   0
100001  0
200001  0
300001  0
400001  0
500001  0
600001  0
700001  0.108814139

First column is physical position, second column is recombination rate (in cM/Mb). Recombination rate was estimated for windows of 100kb.

I first converted these to cM/bp, then I assign the first genetic position in my data to be 1. After that, I go to each position in my ballermix file, and find the corresponting recombination rate bin for that position in the file above. Then, I basically calculate the genetic position as: genpos = prevgenpos + (pos-prevpos)*rr where prevgenpos is the genetic position of the previous variant in the file, pos is the physical position of the current variant, prevpos is the physical position of the previous variant and rr is the recombination rate. (I also cover the possibility that the previous genetic position falls in a different recombination rate bin, but I don't think this is relevant here).

After your first reply, I realized there were other problems with the input file I sent you. I fixed those and I also tried assuming a tiny recombination rate (1e-12) whenever the recombination rate was zero, just so that I got slightly different genetic positions for each site. But I got the same error with this new input file. It is here, if you want to have a look: https://www.dropbox.com/s/77h08q8ao572q2d/Zambia_imiss0.4_lmiss0.2_chr3R.balmixderrec.v2?dl=0

This is how I generated the genetic positions in this file:

if rr==0:                                                                                                                                                                                     
    rr=1e-12
# if previous position was in same recombination rate 100kb window as this one
if prevpos >= rrstart and prevpos <= rrend:
    genpos = prevgenpos + (pos-prevpos)*rr
# if previous position was in a previous recombination rate 100kb window:
elif prevpos < rrstart:
    genpos = prevgenpos + (rrstart - prevpos)*prevrr + (pos - rrstart)*rr

Thank you so much for your help!

bioXiaoheng commented 1 year ago

Hi Debora,

Thanks for your prompt follow-up! I think I've found the bug: If you change line 470 in the script into Tmax = [-100, (1.,0.), 0.], the codes should work now. I'll push this fix to the repo soon along with other minor changes. Thank you for helping me catch this bug!

Cheers, Xiaoheng

bioXiaoheng commented 1 year ago

Hi Xiaoheng,

Thanks for your quick response! It does work with --physPos. However, since I do have a recombination map, I would like to use that information. The second column of my input file begins with lots of 0.0 because in the recombination map for my species (Drosophila melanogaster) the recombination rate in that part of the chromosome is truly zero. Around physical position 700kb there are larger recombination rates. How should I calculate genetic position for those regions of zero recombination? I also tried defining the genetic position for those lines as 1.0 instead of 0.0, but got the same error.

Cheers! Debora

Oh I was just reminded of another tangential issue; have you masked your input with mappability and repeatMasker-like filters? If you have GT calls for each individual, it's also a good idea to have a one-tail Fisher test for HWE and remove outliers for excess heterozygotes (e.g. when paralogs got mapped to the same region).

deboraycb commented 1 year ago

Hi Xiaoheng,

Thanks for the super quick response and fix! It looks like it's working now if I assign a very low recombination rate whenever recombination rate in my map is zero. This avoids having identical genetic positions for several sites.

However, if I don't do this, all the lines in the output are:

genPos  LR      xhat    Ahat    numSites
1.0     -100    1,0     0       0

I'm not sure how common it would be for humans or other organisms to have recombination maps with regions of zero recombination, but if it is common, I think it would be helpful to have a warning or error message about this.

Regarding mappability, repeats and heterozygosity filters, I think those are all directly or indirectly applied to my data already, because these are isogenic lines (there are no heterozygous sites), and inversions and other complicated regions of the Drosophila genome were excluded as well. But thanks for this tip anyway!

Best, Debora

bioXiaoheng commented 1 year ago

Hi Xiaoheng,

Thanks for the super quick response and fix! It looks like it's working now if I assign a very low recombination rate whenever recombination rate in my map is zero. This avoids having identical genetic positions for several sites.

However, if I don't do this, all the lines in the output are:

genPos  LR      xhat    Ahat    numSites
1.0     -100    1,0     0       0

I'm not sure how common it would be for humans or other organisms to have recombination maps with regions of zero recombination, but if it is common, I think it would be helpful to have a warning or error message about this.

Regarding mappability, repeats and heterozygosity filters, I think those are all directly or indirectly applied to my data already, because these are isogenic lines (there are no heterozygous sites), and inversions and other complicated regions of the Drosophila genome were excluded as well. But thanks for this tip anyway!

Best, Debora

The output you showed likely occurred because too many sites had to be included in a window (which is the case when they have the same position), leading to overly large (in absolute value) composite LRs (i.e. even lower than -100) across the entire parameter space. One could potentially remove such output lines by expanding the parameter space for A and/or lowering the default LR value (here it's -100). In theory, the lowest value should always be 0 (when parameter space allows) as our null model is nested within the alternative model.

As for genetic regions with zero recombination, I don't think they are common---one would think genetic regions of super tight LDs would be quite noticeable & notable. Here, I believe what is more often the case is the lack of data on those regions, and the absence of evidence is not evidence for absence. That said, I don't work with empirical data as much so I don't have much say on this. That said, it's certainly a good idea to inform the users when this happens. Thanks for the suggestion!