This repository hosts the software package for BalLeRMix and scripts used in the study "Flexible mixture model approaches that accommodate footprint size variability for robust detection of balancing selection" (Cheng & DeGiorgio 2020).
BalLeRMix/software/
BalLeRMix/Simulation_scripts/
BalLeRMix/Empirical_analysis/
Please cite the following manuscript if using this software:
Xiaoheng Cheng, Michael DeGiorgio (2020) Flexible mixture model approaches that accommodate footprint size variability for robust detection of balancing selection. Molecular Biology and Evolution, 37(11): 3267--3291
In BalLeRMix v2, we introduce the -m <m>
argument to customize the presumed number of alleles being balanced at the selected sites, in case you want to look for multi-allelic balancing selection. The default value is 2.
2020.6.22-Update: Updated the model for multi-allelic balancing selection in v2.2.
2020.2.5-Update: Fixed a minor bug in the initialization module.
usage: BalLeRMix.py [-h] -i INFILE --spect SPECTFILE [-o OUTFILE] [-m M]
[--getSpect] [--getConfig] [--nofreq] [--nosub] [--MAF]
[--physPos] [--rec RRATE] [--fixSize] [-w R]
[--noCenter] [-s STEP] [--fixX X] [--rangeA SEQA]
[--listA LISTA]
You can use python BalLeRMix.py -h
to see the more detailed help page.
For B0 and B2 statistics, the user should first generate the tab-delimited site frequency spectrum file, without header, e.g.:
\<k> \<sample size n> \<proportion in the genome> 1 50 0.03572 2 50 0.02024 ...
or the configuration file with polymorphism/substitution ratio, without header, e.g.:
\<sample size n> \<\% of substitutions> \<\% of polymorphisms> 50 0.7346 0.2654
The input files should have four columns, presenting physical positions, genetic positions, number of derived (or minor) alleles observed, and total number of alleles observed (i.e. sample size). This file should be tab-delimited and should have a header, e.g.:
physPos genPos x n 16 0.000016 50 50 35 0.000035 12 50 ...
To perform B2 scans on your input data, use
python BalLeRMix.py -i <input> --spect <derived allele frequency spectrum> -o <output>
To perform B2,MAF scans on your input data, use
python BalLeRMix.py -i <input> --spect <minor allele frequency spectrum> -o <output> --MAF
To perform B1 scans on your input data, use
python BalLeRMix.py -i <input> --config <sub/poly configuration file> -o <output> --nofreq
To perform B0 scans on your input data, use
python BalLeRMix.py -i <input> --config <derived allele frequency spectrum> -o <output> --nosub
To perform B0,MAF scans on your input data, use
python BalLeRMix.py -i <input> --config <minor allele frequency spectrum> -o <output> --nosub --MAF
To generate spectrum file for B2:
python BalLeRMix.py -i <concatenated input> --getSpect --spect <spectrum file name>
To generate spectrum file for B2,MAF:
python BalLeRMix.py -i <concatenated input> --getSpect --MAF --spect <spectrum file name>
To generate spectrum file for B1:
python BalLeRMix.py -i <concatenated input> --getConfig --spect <config file name>
To generate spectrum file for B0:
python BalLeRMix.py -i <concatenated input> --getSpect --nosbub --spect <spectrum file name>
To generate spectrum file for B0,MAF:
python BalLeRMix.py -i <concatenated input> --getSpect --nosub --MAF --spect <spectrum file name>
All arguments besides the aforementioned ones are for customizing the scan.
[--physPos] [--rec RRATE]
:
Because BalLeRMix
uses genetic distances (in cM) to compute likelihood, to direct the software to use physical positions instead, you should use --physPos
, and indicate the uniform recombination rate (cM/nt) in your species of interest with --rec
. The default value is 10-6 cM/nt.
This argument will be automatically incurred if you choose to fix the window size (e.g., 1000bp, 5kb, etc. ), in which case yuou want to make sure the software is correctly informed of the recombination rate. Using physical positions will also change how you define window sizes and step sizes, if you were to customize the scanning window.
[--fixX X] [--rangeA SEQA] [--listA LISTA]
:
These areguments allow you to specify the parameter space that the software optimizes over. The presumed equilibrium frequency is x, and the rate of decay in linkage disequilibrium is A. If you choose to look for multi-allelic balancing selection where more than two alleles are being balanced, x should be a vector of descending equilibrium frequencies, and should match the number of balanced alleles you chose (via -m
) to scan for.
[--fixSize] [-w R] [--noCenter] [-s STEP] [--physPos]
:
These areguments are for customizing the scanning window. You probably won't need them because BalLeRMix
is robust to window sizes. For more details on how these arguments work, check the v1 software manual.