ChaissonLab / danbing-tk

Toolkit for VNTR genotyping and repeat-pan genome graph construction
BSD 3-Clause "New" or "Revised" License
23 stars 3 forks source link
genome-graph genotyping graph-aligner tandem-repeats vntr

danbing-tk

A toolkit for variable number tandem repeats (VNTRs) analysis, which enables:

  1. building repeat-pan genome graphs (RPGG) given haplotype-resolved assemblies for genome-wide profiling or simply VNTR alleles for targeted genotyping (referred to as danbing-tk build),
  2. genotyping each VNTR as a set of (k-mer, count) given short-read sequencing (SRS) data (referred to as danbing-tk align), and
  3. estimating VNTR or motif dosage from the genotype with bias correction (referred to as danbing-tk predict).

Our initial manuscript illustrates the key concept of this tool. The latest update details improvements on QC and bias correction, and extended applications on eQTL discoveries with motif compositions.

Download Releases

The latest release v1.3.1-manuscript comes with the lastest version of VNTR set, RPGG, and QC statistics.

File Input of Output of
VNTR set tr.good.bed danbing-tk build
RPGG pan.tr.kmers, pan.kmerDBi.umap, pan.kmerDBi.vv, pan.graph.umap danbing-tk align danbing-tk build or vntr2kmers_thread

Building on Linux

git clone --recursive https://github.com/ChaissonLab/danbing-tk
cd danbing-tk && make -j 5

danbing-tk align

Decompress the RPGG RPGG.tar.gz in your working directory.

An example usage to genotype SRS sample using the RPGG:

samtools fasta -@2 -n $SRS.bam |
/$PREFIX/danbing-tk/bin/danbing-tk -gc 85 3 -ae -kf 4 1 -cth 45 -o $OUT_PREF -k 21 -qs pan -fa /dev/stdin -p $THREADS | gzip >$OUT_PREF.aln.gz

danbing-tk align takes ~12 cpu hours to genotype a 30x SRS sample. This will generate $OUT_PREF.tr.kmers and $OUT_PREF.aln.gz output with format specified in File Format.

Important note: If outputs of danbing-tk align are intended to be compared across individuals e.g. association studies, please check the bias_correction notebook before running.

danbing-tk build

Install Dependencies

For users intend to use danbing-tk align or the Scenario 1 of danbing-tk build, this step is not required.

The danbing-tk build pipeline and danbing-tk predict require several external packages. It is recommended to install all requirements using conda as follows:

conda create -n $MY_ENVIRONMENT -c conda-forge -c bioconda \
    python=3.11.4 snakemake=7.30.1 minimap2=2.26 samtools=1.17 bedtools=2.31.0 statsmodels=0.14.0 matplotlib=3.7.2
conda activate $MY_ENVIRONMENT

To check if everything is configured properly (tested on v1.3.2):

  1. Go to /$PREFIX/danbing-tk/test/
  2. Replace $PREFIX in goodPanGenomeGraph.json and input/genome.bam.tsv with the path to danbing-tk
  3. Run snakemake -p -s ../pipeline/GoodPanGenomeGraph.snakefile -j 4 --forceall --output-wait 3

Running danbing-tk build

Scenario 1: building an RPGG for a single TR locus given VNTR alleles

Scenario 2: building an RPGG for a VNTR set given assemblies

Submitting jobs to cluster is preferred as danbing-tk build is compute-intensive, ~1200 cpu hours for the original dataset. Otherwise, remove --cluster and its parameters to run jobs locally.

danbing-tk predict

Locus- and sample-specific biases are critical for normalizing the sum of k-mer counts to VNTR dosage (as a proxy for predicted length) and normalizing the average of k-mer counts to motif dosage. The bias for each locus in each sample is computed from the deviation of local read depth from the global mean given a set of invariant k-mers. Examples of this analysis can be found here

Automated bias correction has been added since v1.3.2. Invariant k-mer metadata ikmer.meta (human readable version ikmer.meta.txt) and example trkmers.meta.txt can be found in Assets.

Example usage:

/$PREFIX/danbing-tk/bin/danbing-tk-pred trkmers.meta.txt ikmer.meta corrected.gt.tsv bias.tsv

Caveat: Estimated k-mer dosage could be inaccurate if the bias term is too close to zero.

Miscellaneous

Leave-one-out analysis

To evaluate the quality of custom RPGG on matching SRS dataset, copy /$PREFIX/danbing-tk/pipeline/leaveOneOut.snakefile to your working directory and edit accordingly. Run the snakemake pipleine with:

snakemake -p -s /$PREFIX/pipeline/LeaveOneOut.snakefile -j 40 --cluster \
    "{params.copts} -c {resources.cores} --mem={resources.mem}G -k" \
    --rerun-incomplete --restart-times 1 --output-wait 30

Submitting jobs to cluster is preferred as this analysis is compute-intensive; otherwise, remove --cluster and its parameters to run jobs locally.

File Format

*.graph.kmers

>locus i
kmer0   out_edges0
kmer1   out_edges1
...
>locus i+1
...

out_edges denotes the presence of T/G/C/A as the next nucleotide encoded with 4 bits.

*.(tr|ntr).kmers

>locus i
kmer0   kmer_count0
kmer1   kmer_count1
...
>locus i+1
...

The second field is optional.

Important Note: the output of danbing-tk align do not contain locus info and the first field for minimal disk usage. The table can be reconstructed using the danbing_aln_output.tr_kmers.metadata.txt.gz from metadata.tar.gz on Zenodo

Alignment output (-a option)