Closed deboraycb closed 1 year ago
Hi Debora,
Thanks for using the software. I noticed that in your input, all the genetic positions (second column from the left) are 0.0. Can you try running your command again with --physPos
tag?
The script uses genetic positions (in cM) by default so everyone having the same position might have confused it. By using the --physPos
tag, you'd instruct it to use physical positions instead. That said, it'd still assume a 1e-6 nt/cM recombination rate to convert physical distance to mapping distance. If your organism-of-interest has a different average recombination rate, it'd be better if you feed this value (in unit of "nt/cM" ofc) to the program through --rec
flag.
Let me know how it goes!
Cheers, Xiaoheng
Hi Xiaoheng,
Thanks for your quick response! It does work with --physPos
. However, since I do have a recombination map, I would like to use that information. The second column of my input file begins with lots of 0.0 because in the recombination map for my species (Drosophila melanogaster) the recombination rate in that part of the chromosome is truly zero. Around physical position 700kb there are larger recombination rates. How should I calculate genetic position for those regions of zero recombination?
I also tried defining the genetic position for those lines as 1.0 instead of 0.0, but got the same error.
Cheers! Debora
Hi Debora,
Good to know --physPos
works! At least now we know the issue is indeed that different SNPs have the same position.
I noticed that for your input file of 2,905,170 SNPs, there are only 59 unique values for genetic positions. May I ask how you parsed the recombination map?
I have only worked with humans so far so it might not apply to Drosophila, but my approach was to assume uniform rates within each gap between checkpoints where we have mapping data. For example, for SNPs A and B whose genetic positions are mapped to be $L_1$ and $L_2$, then for SNP C between them (with physical distance $d_A$ and $d_B$. Note that $d_A + d_B$ and $L_2 - L_1$ describe the same gap but in different units), its genetic position could be imputed as $L_3 = L_1 + \frac{d_A }{ d_A + d_B}(L_2 - L_1) $.
A side note, maybe you already know this, but here are some parsing scripts I posted for a spin-off project for reference: https://github.com/bioXiaoheng/BallerMixPlus/tree/main/parsing_scripts.
Hope that helps!
Xiaoheng
Hi Xiaoheng,
It looks like the recombination map that I have from Drosophila is in a different format. The first few lines of the file looks like this:
1 0
100001 0
200001 0
300001 0
400001 0
500001 0
600001 0
700001 0.108814139
First column is physical position, second column is recombination rate (in cM/Mb). Recombination rate was estimated for windows of 100kb.
I first converted these to cM/bp, then I assign the first genetic position in my data to be 1. After that, I go to each position in my ballermix file, and find the corresponting recombination rate bin for that position in the file above. Then, I basically calculate the genetic position as:
genpos = prevgenpos + (pos-prevpos)*rr
where prevgenpos
is the genetic position of the previous variant in the file, pos
is the physical position of the current variant, prevpos
is the physical position of the previous variant and rr
is the recombination rate. (I also cover the possibility that the previous genetic position falls in a different recombination rate bin, but I don't think this is relevant here).
After your first reply, I realized there were other problems with the input file I sent you. I fixed those and I also tried assuming a tiny recombination rate (1e-12) whenever the recombination rate was zero, just so that I got slightly different genetic positions for each site. But I got the same error with this new input file. It is here, if you want to have a look: https://www.dropbox.com/s/77h08q8ao572q2d/Zambia_imiss0.4_lmiss0.2_chr3R.balmixderrec.v2?dl=0
This is how I generated the genetic positions in this file:
if rr==0:
rr=1e-12
# if previous position was in same recombination rate 100kb window as this one
if prevpos >= rrstart and prevpos <= rrend:
genpos = prevgenpos + (pos-prevpos)*rr
# if previous position was in a previous recombination rate 100kb window:
elif prevpos < rrstart:
genpos = prevgenpos + (rrstart - prevpos)*prevrr + (pos - rrstart)*rr
Thank you so much for your help!
Hi Debora,
Thanks for your prompt follow-up! I think I've found the bug: If you change line 470 in the script into Tmax = [-100, (1.,0.), 0.]
, the codes should work now. I'll push this fix to the repo soon along with other minor changes. Thank you for helping me catch this bug!
Cheers, Xiaoheng
Hi Xiaoheng,
Thanks for your quick response! It does work with
--physPos
. However, since I do have a recombination map, I would like to use that information. The second column of my input file begins with lots of 0.0 because in the recombination map for my species (Drosophila melanogaster) the recombination rate in that part of the chromosome is truly zero. Around physical position 700kb there are larger recombination rates. How should I calculate genetic position for those regions of zero recombination? I also tried defining the genetic position for those lines as 1.0 instead of 0.0, but got the same error.Cheers! Debora
Oh I was just reminded of another tangential issue; have you masked your input with mappability and repeatMasker-like filters? If you have GT calls for each individual, it's also a good idea to have a one-tail Fisher test for HWE and remove outliers for excess heterozygotes (e.g. when paralogs got mapped to the same region).
Hi Xiaoheng,
Thanks for the super quick response and fix! It looks like it's working now if I assign a very low recombination rate whenever recombination rate in my map is zero. This avoids having identical genetic positions for several sites.
However, if I don't do this, all the lines in the output are:
genPos LR xhat Ahat numSites
1.0 -100 1,0 0 0
I'm not sure how common it would be for humans or other organisms to have recombination maps with regions of zero recombination, but if it is common, I think it would be helpful to have a warning or error message about this.
Regarding mappability, repeats and heterozygosity filters, I think those are all directly or indirectly applied to my data already, because these are isogenic lines (there are no heterozygous sites), and inversions and other complicated regions of the Drosophila genome were excluded as well. But thanks for this tip anyway!
Best, Debora
Hi Xiaoheng,
Thanks for the super quick response and fix! It looks like it's working now if I assign a very low recombination rate whenever recombination rate in my map is zero. This avoids having identical genetic positions for several sites.
However, if I don't do this, all the lines in the output are:
genPos LR xhat Ahat numSites 1.0 -100 1,0 0 0
I'm not sure how common it would be for humans or other organisms to have recombination maps with regions of zero recombination, but if it is common, I think it would be helpful to have a warning or error message about this.
Regarding mappability, repeats and heterozygosity filters, I think those are all directly or indirectly applied to my data already, because these are isogenic lines (there are no heterozygous sites), and inversions and other complicated regions of the Drosophila genome were excluded as well. But thanks for this tip anyway!
Best, Debora
The output you showed likely occurred because too many sites had to be included in a window (which is the case when they have the same position), leading to overly large (in absolute value) composite LRs (i.e. even lower than -100) across the entire parameter space. One could potentially remove such output lines by expanding the parameter space for A
and/or lowering the default LR value (here it's -100). In theory, the lowest value should always be 0 (when parameter space allows) as our null model is nested within the alternative model.
As for genetic regions with zero recombination, I don't think they are common---one would think genetic regions of super tight LDs would be quite noticeable & notable. Here, I believe what is more often the case is the lack of data on those regions, and the absence of evidence is not evidence for absence. That said, I don't work with empirical data as much so I don't have much say on this. That said, it's certainly a good idea to inform the users when this happens. Thanks for the suggestion!
Hello,
Thanks for the great method and software! I have been getting this error message when I try to run B2_maf (full output copied at the bottom of this message)
I noticed there was a try/except commented out in that part of the code:
and when I uncomment those lines, the Tmax that is printed out is [-100,0,0] (in case that helps). Any hints about what could be going wrong?
This is my command line:
python BalLeRMix/software/BalLeRMix_v2.4.py -i Zambia_imiss0.4_lmiss0.2_chr3R.balmixderrec -o Zambia_imiss0.4_lmiss0.2_chr3R.B2 --spect Zambia_imiss0.4_lmiss0.2_chr3R.B2spect_MAF --MAF
and here is my input file: https://www.dropbox.com/s/lgclhkilpc13sgg/Zambia_imiss0.4_lmiss0.2_chr3R.balmixderrec?dl=0 and the spec file: https://www.dropbox.com/s/7j6x9bqna0guncq/Zambia_imiss0.4_lmiss0.2_chr3R.B2spect_MAF?dl=0Best, Débora
Here is the full output: