XuewenWangUGA / TRcaller

A breakthrough bioinformatics tool TRcaller in Java: the fastest and most accurate STR/SSR/TR genotyping tool
https://www.trcaller.com/index.aspx
2 stars 0 forks source link
code command script

TRcaller 2.0

Features

Updates

Nweest version : v2.0 TRcaller 2.0 has been updated to the new version 2.0 for a better performance, higher accuracy, and easy usage.

The fastest and most accurate tandem repeat genotyping tool by far for both short and long Next Generation Sequencing reads from Illumina, PacBio and Nanopore.

Install TRcaller

The software can be downloaded for a direct use. No additional compiling and installation. Get it from Github:

git clone https://github.com/XuewenWangUGA/TRcaller

or download the zip compressed files and then unzip to TRcaller

Update Java run environment if necessary. The software will use the Java runtime environment (SE) V17.

If your computer has an old version of Java runtime, please install the newest Java or Java SE 17 or higher from https://www.oracle.com/java/technologies/downloads/. Either Java or Java SE should work. In this case, you should put path before java. e.g. the dwonloaded java binary is in c:/java21/bin; then type the command to run TRcaller as the followings:

javaPath=c:/java21/bin

$javaPath/java -jar TRcaller.jar

Quick start

After downloading the tool. Run with the test data set coming with TRcaller. Type the following command beblow in your command terminal: The v2.0 after TRcaller will be version number, TRcallerv2.0, you can remove the version number as needed. Just two files are required, the bed file with position of TR in a reference genome and the read alignment file. If the user will use the default bed file "ForensicCODIS_v1.1.bed", the only mandatory file will be the alignment file.

java -jar TRcaller.jar -i HG002.GRCh38.2x250.subset.bam or

java -jar TRcaller.jar -b ForensicCODIS_v1.1.bed -i HG002.GRCh38.2x250.subset.bam or

java -Xmx2G -jar TRcaller.jar -b ForensicCODIS_v1.1.bed -i HG002.GRCh38.2x250.subset.bam

The memery option is optional. For a very large file, the user can increase the memory, e.g. -Xmx10G

The following data will be generated. The test output files from Human sample HG002 are available on Github.

Result files:
Output raw haplotype: HG002.GRCh38.2x250.subset.bam.TRcaller_Hap.raw.txt
Output report: HG002.GRCh38.2x250.subset.bam.TRcaller_Hap.rept.txt
Output statistical summary: HG002.GRCh38.2x250.subset.bam.TRcaller_Hap.stat.txt
All in one Excel file: HG002.GRCh38.2x250.subset.bam.TRcaller.xlsx

Understand the results' data format

The ".Hap.raw.txt" file has all TR allele candidates in the input read alignment file;

The ".Hap.rept.txt" and Microsoft Excel format ".xlsx" file has the same content of TR alleles after filtering noisy canidatates; Data in Tab separated format looks like

#Report is generated by TRcaller v2.0
#The general report with details
#Marker Count   Read_proportion Sample_hap_length   Ref_hap_length  Sample_allele   Ref_allele  Haplotype   Validation
D1S1656 31  0.6078431372549019  52  68  13  17  CCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTA    PASS
D1S1656 19  0.37254901960784315 56  68  14  17  CCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTA    PASS
TPOX    53  0.8688524590163934  32  32  8   8   AATGAATGAATGAATGAATGAATGAATGAATG    PASS
D2S441  25  0.49019607843137253 44  48  11  12  TCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTA    PASS
D2S441  21  0.4117647058823529  60  48  15  12  TCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATTTATCTATCTA    PASS
D2S1338 15  0.4411764705882353  88  92  22  23  GGAAGGAAGGACGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGCAGGCAGGCAGGCAGGCAGGCAGGCA    PASS
D2S1338 14  0.4117647058823529  96  92  24  23  GGAAGGAAGGACGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGCAGGCAGGCAGGCAGGCAGGCA    PASS
D3S1358 34  0.5074626865671642  64  64  16  16  TCTATCTGTCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTA    PASS
D3S1358 30  0.44776119402985076 60  64  15  16  TCTATCTGTCTGTCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTA    PASS
FGA 19  0.4634146341463415  80  88  20  22  GGAAGGAAGGAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAGAAAAAAGAAAGAAAGAAA    PASS
FGA 18  0.43902439024390244 92  88  23  22  GGAAGGAAGGAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAGAAAAAAGAAAGAAAGAAA    PASS
D5S818  34  0.5483870967741935  48  44  12  11  ATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCT    PASS
D5S818  26  0.41935483870967744 44  44  11  11  ATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCT    PASS
CSF1PO  33  0.6346153846153846  48  52  12  13  ATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCT    PASS
CSF1PO  17  0.3269230769230769  40  52  10  13  ATCTATCTATCTATCTATCTATCTATCTATCTATCTATCT    PASS
D7S820  24  0.46153846153846156 48  52  12  13  TATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATC    PASS
D7S820  23  0.4423076923076923  44  52  11  13  TATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATC    PASS
D8S1179 27  0.5 52  52  13  13  TCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTA    PASS
D8S1179 24  0.4444444444444444  64  52  16  13  TCTATCTATCTGTCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTA    PASS
D10S1248    24  0.4444444444444444  64  52  16  13  GGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAA    PASS
D10S1248    22  0.4074074074074074  56  52  14  13  GGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAA    PASS
TH01    30  0.6122448979591837  36  28  9   7   AATGAATGAATGAATGAATGAATGAATGAATGAATG    PASS
TH01    17  0.3469387755102041  39  28  9.3 7   AATGAATGAATGAATGAATGAATGATGAATGAATGAATG PASS
vWA 26  0.49056603773584906 72  68  18  17  TAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGACAGACAGACAGACAGATAGA    PASS
vWA 25  0.4716981132075472  64  68  16  17  TAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGACAGACAGACAGACAGATAGA    PASS
D12S391 18  0.46153846153846156 88  76  22  19  AGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGACAGACAGACAGACAGACAGACAGACAGACAGAC    PASS
D12S391 18  0.46153846153846156 88  76  22  19  AGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGACAGACAGACAGACAGACAGACAGACAGACAGAT    PASS
D13S317 24  0.46153846153846156 52  44  13  11  TATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATC    PASS
D13S317 24  0.46153846153846156 44  44  11  11  TATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATC    PASS
D16S539 64  0.9142857142857143  44  44  11  11  GATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATA    PASS
D18S51  29  0.5686274509803921  52  72  13  18  AGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAA    PASS
D18S51  21  0.4117647058823529  64  72  16  18  AGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAA    PASS
D19S433 16  0.48484848484848486 56  64  14  14  CCTTCCTTCCTTCCTTCCTTCCTTCCTTCCTTCCTTCCTTCCTTCCTTCCTACCTTCTTTCCTT    PASS
D19S433 10  0.30303030303030304 66  64  16.2    14  CCTTCCTTCCTTCCTTCCTTCCTTCCTTCCTTCCTTCCTTCCTTCCTTCCTTCCTTCCTTCCTACCTTTTCCTT  PASS
D21S11  22  0.6875  126 127 31.2    29  TCTATCTATCTATCTATCTATCTGTCTGTCTGTCTGTCTGTCTGTCTATCTATCTATATCTATCTATCTATCATCTATCTATCCATATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATATCTA   PASS
D21S11  6   0.1875  120 127 30  29  TCTATCTATCTATCTATCTGTCTGTCTGTCTGTCTGTCTGTCTATCTATCTATATCTATCTATCTATCATCTATCTATCCATATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTA PASS
D22S1045    49  0.9245283018867925  48  51  16  17  ATTATTATTATTATTATTATTATTATTATTATTATTATTACTATTATT    PASS

The "Hap.stat.txt" file has the statistical information.

Help and options

For help and more advance options, type the following command below:

java -jar TRcallerv2.0.jar

TRcaller v2.0
usage: java -jar -Xmx10G TRcaller.jar [options]
 -b,--bed <arg>      required, .bed format configure file with a path
 -c,--count <arg>    integer, minimum count of supported reads for report TR alleles,default [2]
 -i,--input <arg>    required, input BAM file with a path
 -l,--log <arg>      string, log file name, default [log.txt]
 -o,--output <arg>   prefix of output file name ofr saving result
 -r,--ratio <arg>    float, minimum value of supported read ratio of all reads at each loci, default [0.05]
 -s,--source <arg>   integer, source type of DNA, 1 for single individual, 2 or higher for DNA mixture, default [1]. for XY chromosomes, it should be set to 3
 -t,--thread <arg>   integer, the number of computing threads, default [2]

Scripts for the paper

The scripts and setting files for TRcaller paper (https://www.frontiersin.org/journals/genetics/articles/10.3389/fgene.2023.1227176/full) are in the folder: TRcaller_paper_files, scripts for TRcaller versions between 2.0

TRcaller will generated the identical results for 20 core CODIS STRs independent of which forensic sequencing kits are used.

TRcaller online version

The free testing account is opened at www.trcaller.com/index.aspx

Abstract of the coming paper:

Calling tandem repeat (TR) variants from DNA sequences is of both theoretical and practical significance. Some bioinformatics tools have been developed for detecting or genotyping TRs. However, little study has been done to genotyping TR alleles from long-read sequencing data, and the accuracy of genotyping TR alleles from next generation sequencing data still needs to be improved. Herein, a novel algorithm is described to retrieve TR regions from sequence alignment, and a software program TRcaller has been developed and integrated into a web portal to call TR alleles from both short- and long-read sequences, both whole genome and targeted sequences generated from multiple sequencing platforms. All TR alleles are genotyped as haplotypes and the robust alleles will be reported, even multiple alleles in a DNA mixture. TRcaller could provide substantially higher accuracy (> 99% in 289 human individuals) in detecting TR alleles with magnitudes faster (e.g., ~2 seconds for 300x human sequence data) than the mainstream software tools. The web portal preselected 119 TR loci from forensics and disease plus customer giving TR loci. TRcaller is validated to be scalable in various applications, such as DNA forensics and disease diagnosis, which can be expanded into other fields like breeding programs.

Availability: TRcaller is available at www.trcaller.com/index.aspx.

More detailed information is on https://github.com/Ge-Lab/TRcaller

Manual: Manual_TRcaller_aug28_2023.pdf


Colorful visliaztion of TR alleles

STRD8 Fig 1. D8S1179 colorful STR Alleles

STRD19 Fig 2. D19S433 colorful STR Alleles

Citation:

X Wang, H Meng, B Budowle, J Ge. 2023, TRcaller: a novel tool for precise and ultrafast tandem repeat variant genotyping in massively parallel sequencing reads, Frontiers in genetics , https://www.frontiersin.org/journals/genetics/articles/10.3389/fgene.2023.1227176/full