BGI-shenzhen / VCF2Dis

VCF2Dis: A new simple and efficient software to calculate p-distance matrix and construct population phylogeny based Variant Call Format
MIT License
75 stars 20 forks source link
boostrap distance-matrix nj-tree population population-genetics

VCF2Dis

VCF2Dis: A new simple and efficient software to calculate p-distance matrix and construct phylogeny based Variant Call Format input

1) Install and Parameter


The new version will be updated and maintained in hewm2008/VCF2Dis, please click below Link to download the latest version

hewm2008/VCF2Dis

Download


Just sh make.sh to compile. The executable VCF2Dis can be found in the folder of bin/VCF2Dis
For Linux /Unix and macOS

        tar -zxvf  VCF2DisXXX.tar.gz            # if Link do not work ,Try re-install [zlib]library
        cd VCF2DisXXX;                          # [zlib] and copy them to the library Dir
        sh make.sh;                             # VCF2Dis-xx/src/include/zlib
        ./bin/VCF2Dis
  

Note: If fail to link,try to re-install the libraries zlib
Note:: R with ape, dplyr and ggtree are recommended

</br> For more details, please use <b>-help </b> and see the [example](https://github.com/hewm2008/VCF2Dis/blob/main/example)
```php
        -InFormat      <str>   Input File is [VCF/FA/PHY] Format,defaut: [VCF]
        -InSampleGroup <str>   InFile of sample Group info,format(sample groupA)
        -TreeMethod    <int>   Construct Tree Method,1:NJ-tree 2:UPGMA-tree [1]
        -KeepMF                Keep the Middle File diff & Use matrix

2) Example


Three examples were provided in the directory of example/Example*

1) an Example of nj-tree with no boostrap


# 1.1) To new all the sample p_distance matrix and newick tree based VCF, run VCF2Dis directly
      ./bin/VCF2Dis -InPut  in.vcf.gz   -OutPut p_dis.mat
      #  ./bin/VCF2Dis     -InPut  in.fa.gz -OutPut p_dis.mat -InFormat  FA

# 2.2) To new sub group sample p_distance matrix and and newick tree ; put their sample name into File sample.list
      ./bin/VCF2Dis -InPut  chr1.vcf.gz chr2.vcf.gz -OutPut p_dis.mat  -SubPop  sample.list

2) an Example of nj-tree with boostrap

#!/bin/bash
NN=100
if [ "$#"  -eq  1 ]; then
    NN=$1
fi
for X in $(seq 1 $NN)
do
    ./bin/VCF2Dis -InPut in.vcf.gz -OutPut p_dis_${X}.mat -Rand 0.25
    # PHYLIPNEW-3.69.650/bin/fneighbor -datafile p_dis_${X}.mat -outfile tree.out1_${X}.txt -matrixtype s -treetype n -outtreefile tree.out2_${X}.tre
done
#!/bin/bash
NN=100
if [ "$#"  -eq  1 ]; then
  NN=$1
fi

cat  p_*.nwk  >    alltree_merge.tre   #  cat  tree*.tre  > alltree_merge.tre
PHYLIPNEW-3.69.650/bin/fconsense   -intreefile   alltree_merge.tre  -outfile out  -treeprint Y
perl  ./bin/percentageboostrapTree.pl    alltree_merge.treefile    $NN    Final_boostrap.tre  # NN is the input number

How to Install PHYLIPNEW please Click on here or Click on here(Chinese)


4) Introduction


The formula for calculating p-distance between indivisuals from VCF SNP datasets was listed below:

            D_ij=(1/L) * [(sum(d(l)_ij))]


Where L is the length of regions where SNPs can be identified, and given the alleles at position l are A/C between sample i and sample j:

            d(l)_ij=0.0     if the genotypes of the two individuals were AA and AA;
            d(l)_ij=0.5     if the genotypes of the two individuals were AA and AC;
            d(l)_ij=0.0     if the genotypes of the two individuals were AC and AC;
            d(l)_ij=1.0     if the genotypes of the two individuals were AA and CC;
            d(l)_ij=0.0     if the genotypes of the two individuals were CC and CC;

To further know about the p_distance matrix based the VCF file, please refer to this website.

5) Results


VCF2Dis have been cited in more than 150 times by searching against google scholar.
Below were some NJ-tree images that I draw in the paper before.

example1.png

6) Discussing


######################swimming in the sky and flying in the sea ########################### ##