This repository hosts the software package for BalLeRMix+, an extension of BalLeRMix that can jointly detect recent positive selection and long-term balancing selection.
To cite this software: Xiaoheng Cheng, Michael DeGiorgio (2021) BalLeRMix+: Mixture model approaches for robust joint identification of both positive selection and long-term balancing selection. Bioinformatics btab720
In BalLeRMix+, we introduce the optional --findPos
and --findBal
, as well as --fixAlpha <abeta>
, arguments to help specify the disperson parameter in beta-binomial distributions, with a value between 0 and 1 indicating positive selection, and a value larger than 1 indicating balancing selection. Based on feedback from BalLeRMix users, we also added a --minCount
argument to indicate the smallest number of allele counts in the input data, in case the user chooses to remove rare variants from the data.
To help visualize the output, we also added an R script in the test/
folder in case you need.
Note that BalLeRMix+ does not consider multi-allelic balancing selection as does BalLeRMix_v2.2.
usage: BalLeRMix+_v1.py [-h] -i INFILE [-o OUTFILE] --spect SPECTFILE [--minCount MINCOUNT] [--getSpect] [--getConfig]
[--noFreq] [--MAF] [--findBal] [--findPos] [--usePhysPos] [--rec RRATE]
[--fixWinSize] [-w W] [--noCenter] [-s STEP] [--fixX X] [--fixAlpha ABETA] [--rangeA SEQA] [--listA LISTA]
You can use python BalLeRMix+_v1.py -h
to see a more detailed help page.
To install BalLeRMix+
, simply navigate to your working directory and clone this repository:
git clone https://github.com/bioXiaoheng/BallerMixPlus.git
The current release supports Python version 3.6 and above, and needs to have numpy (>=1.19.1) and scipy (>=1.5.0) installed.
To run the program, the user can build commands based on python BalLeRMix+_v1.py -i <input file> [other arguments]
, with no particular order required for each argument. In particular, useful argument combinations are
B variant | Input data type | To generate helper file | To run scan |
---|---|---|---|
B2 | Derived allele frequency (DAF) plus substitutions (x==n) | --getSpect --spect <spect file> |
-o <output file> |
B2,MAF | Minor allele frequency (MAF) plus substitutions (x==0) | --getSpect --MAF``--spect <spect file> |
--MAF -o <output file> |
B1 | Polymorphism (0<x<n) or substitution (x==n) calls | --getConfig --spect <config file> |
--noFreq -o <output file> |
B0 | Derived allele frequency (DAF) | --getSpect --noSub --spect <spect file> |
--noSub -o <output file> |
B0,MAF | Minor allele frequency (MAF) | --getSpect --noSub --MAF --spect <spect file> |
--noSub --MAF -o <output file> |
Note that when generating helper files, the input must be concatenated from all sequences in the genome. One command-line option to concatenate them is (suppose your input files are named input_chr<n>.txt
):
cat input_chr*.txt | awk '(NR == 1 || $1 != "position")' > Concatenated_input_for_helper_file.txt
There are two examples in test/
folder, including their input files as well as some helper files. Both examples are simulated using SLiM3.3 (Haller & Messer 2019), and their allele frequencies are polarized by the sequence from an outgroup that diverged from the main population 5 million years ago (similar to the human-chimpanzee divergence). Example 1 is an 100kb sequence evolving under the CEU demography (inferred by SMC++; Terhorst et al. 2017), carrying a de novo beneficial mutation that occurred 10,000 generations ago, with a selective advantage of s=0.01. Example 2 is a 50kb sequence carrying an ancient locus under over-dominance balancing selection (s=0.001, dominance h=20) occuring 5 million years before the outgroup diverges.
Note that the sequence in Example 2 actually evolved under a constant population size and, in reality, should not use the same set of helper files as does Example 1. However, because 1) footprints of ancient balancing selection are in general not sensitive to recent population size changes, and 2) the two examples here are only for demonstration purpose, we will use the set of helper files for Example 1 here throughout for convenience.
To obtain helper files from the concatenated input (test/HC_CEU_Neut_Concatenated-DAF.txt
, concatenated from all the replicates evoling neutrally under the same demographic history as Example 1), you can run:
The resulting files should be the same as test/HC_CEU_Neut_config_for_B1.txt
, test/HC_CEU_Neut_DAF_spect_for_B2.txt
, test/HC_CEU_Neut_MAF_spect_for_B2maf.txt
, test/HC_CEU_Neut_DAF-noSub_spect_for_B0.txt
, and test/HC_CEU_Neut_MAF-noSub_spect_for_B0maf.txt
, respectively. Past BalLeRMix
users may need to be aware that the output's header of BalLeRMix+
is named slightly differently from before.
You will find that the output should be close, if not identical, to the files test/output/Example1_B1.txt
, test/output/Example1_B2.txt
, and test/output/Example1_B2maf.txt
, respectively. Example1 is simulated to have undergone a selective sweep of s=0.01. Because B0 and B0,MAF have little power for positive selection, we do not recommend the user to use them for inferring positive selection.
To visualize the B1, B2, and B2,MAF scores, you can try:
Rscript test/plotScores.r <output from BalLeRMix+_v1.py> <image name.png>
With the above command, you should be able to generate images resembling test/ScorePlot_example1_B2.png
. Note that this R script requires the ggplot2
package and will output images in PNG format.
The same operations can be repeated on example 2, and you can check your output with test/output/Example2_B1.txt
, test/output/Example2_B2.txt
, and test/output/Example2_B2maf.txt
accordingly.
By default, the BalLeRMix+_v1.py
script uses all informative sites provided in the input. With additional arguments, users can customize the size and centers of their sliding windows for the analyses. Details on these areguments are listed in section 4 As an example, one can run
# compute B0maf (--noSub --MAF) while:
## using physical positions as coordinates (--usePhysPos)
## scan with sliding windows of a fixed physical size (--fixWinSize) of 1000 nt (-w 1000)
## centered on every other (--step 2) informative sites
python BalLeRMix+_v1.py --noSub --MAF --spect test/HC_CEU_Neut_MAF-nosub_spect_for_B0maf.txt -i test/Example2_balancing_10MYA_MAF_nosub.txt -o testout_B0maf_1kb-500b.txt --usePhysPos --fixWinSize -w 1000 --step 2
The output should be close, if not identical, to test/output/Example2_B0maf_1kb-2sites.txt
.
All arguments besides the aforementioned ones are for customizing the scan. Their usage is largely based on the previous BalLeRMix
(find out more in its detailed manual here ). Note that these arguments are not necessary for the scan to complete.