SayakaMiura / TopHap

4 stars 0 forks source link

TopHap_v1.2.1

Description

TopHap infers bootstrap-supported phylogenies of common haplotypes in the given data. See Caraballo et al. (ref. 1) for the detail. The TopHap program has been developed by Sudhir Kumar. It is written in Python. You are free to download, modify, and expand this code under a permissive license similar to the BSD 2-Clause License (see below).

Dependencies

  1. python 3 (v3.7.9 and v3.8.3 were tested)

    python packages:

    numpy

    biopython

    Note: If the installation of these python packages is not easy, you may want to use Anaconda for Python 3 (https://www.anaconda.com/distribution/). Or you can try python3 -m pip install [package name].

  2. R (v3.5.3 and v4.0.3 were tested)

    R package:

    ape

    phangorn

    Please make sure “Rscript” command is functional.

  3. MEGA

    Please download the latest version from https://www.megasoftware.net/.

  4. RaxML (optional)

    It can be downloaded at https://cme.h-its.org/exelixis/web/software/raxml/.

How to use TopHap

1. Run vcf_json_parse.py. (optional)

Given the alignment of all genomes (i.e., aligned with Wuhan1 reference genome sequence), common nucleotides (positions with desired minor allele frequency (maf) threshold (e.g., > 5%) are extracted and haplotype alignments that contain only these genomic positions are generated for each spatiotemporal slice of the dataset (country and sampling month).

python3 vcf_json_parse.py [input full genome alignment] --reference Wuhan1Gnome.fasta --min_subgroup_size 500 --one_based --skip_mismatches --min_freq 0.05 -o [output directory]

If you have fasta file, you can use the following command to make a json file. In the output json file, the position is counted from 0. The output file is created in the same directory as the input fasta file.

python3 Fas2Json.py [input fasta file without outgroup sequence] [Outgroup sequence without sequence ID]

input

options

output files

example

"Example.json" can be processed by,

python3 vcf_json_parse.py Example.json --reference Wuhan1Gnome.fasta --min_subgroup_size 500 --one_based --skip_mismatches --min_freq 0.05 -o ExampleHap

The output alignment files together with genomic positions extracted ("Haplotypes.txt") are stored in "ExampleHap" directory.

2. Run TopHap.py.

The main program that infers bootstrap-supported phylogenies of common haplotypes in the given data.

python3 TopHap.py [haplotype frequency cutoff] [number of bootstrap replicates] –Hap [path to the directory of the haplotype alignments]

Please provide Haplotypes.txt that lists the genomic positions (count from position 0) that are used for the haplotype alignment. Haplotypes.txt should be placed in the same directory of the haplotype alignments. An example Haplotypes.txt can be found in the Alignment directory.

options

output files

example

Example datasets can be found in Alignment.

python3 TopHap.py 0.05 100 -Hap Alignment

3. Run TopHap_Attach.py (optional)

Attach minor haplotype sequences into a TopHap phylogeny. For a TopHap phylogeny, haplotype of interest will be attached.

python3 TopHap_Attach.py [TopHap alignment] [TopHap tree] [minor haplotype] [path to raxml]

inputs

output files

example

Example datasets can be found in Example_attach. TopHap alignment and tree are TopHap_prune.fasta and TopHap_bootstrap1.nwk, respectively. To attach haplotypes in MinorHap.fasta, run,

python3 TopHap_Attach.py Example_attach\TopHap_prune.fasta Example_attach\TopHap_bootstrap1.nwk Example_attach\MinorHap.fasta raxmlHPC-AVX.exe

Reference:

[1] Marcos A. Caraballo-Ortiz, Sayaka Miura, Sergei L. K. Pond, Qiqing Tao, and Sudhir Kumar. TopHap: TopHap: Rapid inference of key phylogenetic structures from common haplotypes in large genome collections with limited diversity (2021) Submitted to Bioinformatics

Copyright 2022, Authors and Temple University

BSD 3-Clause "New" or "Revised" License, which is a permissive license similar to the BSD 2-Clause License except that it prohibits others from using the name of the project or its contributors to promote derived products without written consent. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  1. Redistributions of source code must retain the above copyright notice, this list of conditions, and the following disclaimer.
  2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions, and the following disclaimer in the documentation and/or other materials provided with the distribution.
  3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.