conda install -c conda-forge gimble
gIMble workflow. gimbleprep
(0) assures input data conforms to requirements; parse
(1) reads data into a gIMble
store, the central data structure that holds all subsequent analysis. The modules blocks
(2) and
windows
(3) partition the data which is summarised as a tally (4) of blockwise mutation
configurations (bSFSs) either across all pair-blocks (blocks tally) or for pair-blocks in windows
(windows tally). Tallies may be used either in a bounded search of parameter space via the
module optimize
(5) or to evaluate likelihoods over a parameter grid (which is precomputed using
makegrid
(6)) via the gridsearch
(7) module. The simulate
(8) module allows coalescent
simulation of tallies (simulate tally) based on inferred parameter estimates (either global
estimates or gridsearch results of window-wise data). Simulated data can be analysed to quantify
the uncertainty (and potential bias) of parameter estimates. The results held within a gIMble
store
can be described, written to column-based output files or removed using the modules info
(9),
query
(10), and delete
(11).
usage: gimble <module> [<args>...] [-V -h]
[Input]
preprocess Install gimbleprep instead
parse Parse files into GimbleStore
blocks Generate blocks from parsed data in GimbleStore (requires 'parse')
windows Generate windows from blocks in GimbleStore (requires 'blocks')
tally Tally variation for inference (requires 'blocks' or 'windows')
[Simulation]
simulate Simulate data based on specific parameters or gridsearch results
[Inference]
optimize Perform global parameter optimisation on tally/simulation
makegrid Precalculate grid of parameters
gridsearch Evaluate tally/simulation against a precomputed grid (requires 'makegrid')
[Info]
info Print metrics about data in GimbleStore
list List information saved in GimbleStore
query Extract information from GimbleStore
delete Delete information in GimbleStore
[Experimental]
partitioncds Partition CDS sites in BED file by degeneracy in sample GTs
[Options]
-h, --help Show this screen
-V, --version Show version
Note: The preprocess
module has been replaced by gimbleprep
. Everything else is identical.
The preprocess
module assures that input files are adequately filtered and processed so that the gimble
workflow can be completed successfully.
While this processing of input files could be done more efficiently with other means, it has the advantage of generating a VCF file complies with gimble
data requirements but which can also be used in alternative downstream analyses.
conda install -c bioconda gimbleprep
gimbleprep -f FASTA -b BAM_DIR/ -v RAW.vcf.gz -k
Based on the supplied input files:
-f
: FASTA of reference-b
: directory of BAM files, composed of readgroup-labelled reads mapped to reference -v
: compressed+indexed Freebayes VCF file the module produces the following output files:
After running, output files require manual user input (see Manually modify files)
- {RAW_VARIANTS} := all variants in VCF
- {NONSNP} := non-SNP variants
- {SNPGAP} := all variants within +/- X b of {NONSNP} variants
- {QUAL} := all variants with QUAL below --min_qual
- {BALANCE} := all variants with any un-balanced allele observation (-e 'RPL<1 | RPR<1 | SAF<1 | SAR<1')
- {FAIL} := {{NONSNP} U {SNPGAP} U {QUAL} U {BALANCE}}
- {VARIANTS} := {RAW_VARIANTS} - {FAIL}```
The processed VCF file
- only contains variants from set `{VARIANTS}`
- only contains sample genotypes with sample read depths within coverage thresholds (others are set to missing, i.e. `./.`)
- {CALLABLE_SITES} := for each BAM, regions along which sample read depths lie within coverage thresholds.
- {CALLABLES} := bedtools multiintersect of all {CALLABLE_SITES} regions across all BAMs/samples
- {FAIL_SITES} := sites excluded due to {FAIL} variant set (during VCF processing)
- {SITES} := {CALLABLES} - {FAIL_SITES}
Resulting BED file
{SITES}
gimble.genomefile
:
gimble.samples.csv
gimble.bed
bedtools intersect -a gimble.bed -b my_intergenic_regions.bed > gimble.intergenic.bed
gimble parse -v gimble.vcf.gz -b gimble.intergenic.bed -g gimble.genomefile -s gimble.samples.csv -z analysis
--block_length
(number of callable sites in each block) and --block_span
(maximum distance between the first and last site in a block)gimble blocks -z analysis.z -l 64
--blocks
controls how many blocks are incorporated into each window and the parameter --steps
by how many blocks the next window is shifted--blocks
should be chosen so that, given the number of interspecific pairs, enough blocks from each pair can be placed in a window.
gimble windows -z analysis.z -w 500 -s 100
gimble info -z analysis.z
\underline{k}_i
$, which count the four possible mutation types found within a pair-block $i$.gimble tally -z analysis.z -k 2,2,2,2 -l blocks_kmax2 -t blocks
gimble tally -z analysis.z -k 2,2,2,2 -l windows_kmax2 -t windows
gimble optimize -z analysis.z -l IM_BA_optimize -d tally/windows_kmax2 \
-w -m IM_BA -r A -u 2.9e-09 -A 10_000,2_500_000 -B 10_000,1_000_000 \
-C 1_000_000,5_000_000 -M 0,1e-5 -T 0,5_000_000 -g CRS2 -e 19 -i 10_000
optimize
), or to fit models to window-wise data.
gimble makegrid -z analysis.z -m IM_AB \
-b 64 -r A -u 2.9e-9 -k 2,2,2,2 \
-A=200_000,3_000_000,12,lin \
-B=100_000,2_000_000,12,lin \
-C 100_000,2_000_000,12,lin \
-T 4_256_034 -M 0,2.21E-06,16,lin \
-e 19 -l IM_BA_grid
makegrid
gimble gridsearch -z analysis.z \
-g makegrid/IM_BA_grid -d tally/windows_kmax2 -p 50
# based on fixed parameters
gimble simulate -z analysis.z \
--seed 19 --replicates 100 --windows 11217 --blocks 500 \
--block_length 64 -a 10 -b 10 \
-k 2,2,2,2 -s IM_BA_sims_100reps -p 55 -m IM_BA \
-A 1_407_027 -B 545_041 -C 923_309 -M 7.3778010583E-07 -u 2.9e-9 -T 4_256_034
gimble simulate -z analysis.z \ --seed 19 --replicates 100 --windows 11217 --blocks 500 \ --block_length 64 -a 10 -b 10 \ --gridsearch_key gridsearch/windows_kmax2/IM_BA_grid \ -k 2,2,2,2 -s IM_BA_grid -p 55 -u 2.9e-9