This repo contains scripts and utilities for analyzing tandem repeats (TRs).
Installation
To install the latest version using pip, run:
python3 -m pip install --upgrade https://github.com/broadinstitute/str-analysis/archive/refs/heads/main.zip
or use the docker image (though it may not have the latest version of the code):
docker run -it weisburd/str-analysis:latest
Tools
- call_non_ref_motifs (docs) - takes a bam/cram file and, optionally, an ExpansionHunter variant catalog. Then, for each
locus, it determines which STR motifs are supported by reads overlapping that locus before running ExpansionHunter on the motif(s) it detected.
- filter_vcf_to_STR_variants - takes a single-sample VCF file and filters it to the INS/DEL variants that represent
tandem repeat expansions or contractions by peforming brute-force k-mer search on each variant's inserted or deleted
bases. This tool was a core part of Weisburd, B., Tiao, G. & Rehm, H. L. Insights from a genome-wide truth set of tandem repeat variation. (2023)
- merge_loci - takes one or more STR catalogs and combines them into a single catalog while removing
duplicates based on overlap and repeat motif.
- annotate_and_filter_str_catalog - takes an STR catalog and annotates the loci based on their overlap with genes
and known disease associated STRs. It then allows filtering by motif size, gene region, and various other criteria.
- compute_catalog_stats - takes an annotated catalog output by the annotate_and_filter_str_catalog script and
computes various summary statistics about it.
- add_offtarget_regions - takes an ExpansionHunter variant catalog and adds a list of off-target regions to each
locus definition by querying a database of off-target regions that have been precomputed for each TR motif.
This database was generated by using wgsim to simulate fully-repetitive reads for each motif, and then recording
where these reads mapped on hg19 and hg38 after aligning them using bwa.
- add_adjacent_loci_to_expansion_hunter_catalog - takes an ExpansionHunter variant catalog and a bed file containing
all simple repeats in the reference genome. Outputs a new catalog with updated LocusStructures and ReferenceRegions
that include any adjacent repeats found near each locus in the input catalog.
- check_trios_for_mendelian_violations - takes a table of combined ExpanssionHunter calls generated by the
combine_str_json_to_tsv
as well as a FAM or PED file with parent/child relationships, and outputs a table of mendelian violations in the callset.
- simulate_str_expansions - uses wgsim to generate .bam files with simulated read data containing STR expansions
at a given locus, and having a given number of repeats, motif, zygosity, etc.
- filter_out_loci_with_Ns_in_flanks - removes loci from an ExpansionHunter catalog if their flanks contain enough Ns to trigger an ExpansionHunter error.