C++ version: C++11
R Language version: 3.2.2 or above but not exceed 4.0
python version: 3.6 or higher
kmc(https://github.com/refresh-bio/KMC.git)
Attention: if you use C++17, it maybe reports errors. When naming files, please use '_' as separators instead of '.', as using '.' will cause program errors.
If you show error message when installing R package, you can skip the installation of R package and build the tree manually in the last step.
git clone https://github.com/Argonum-Clever2/mike.git
cd src
make
# install R package
Rscript install.r
You need to install KMC in advance, and add kmc to PATH. Then, run command below. The file will be processed into a kmer file. Or you can input the kmer file directly, just skip the step.
# help
python kmc.py --help
# run
python kmc.py -f file1 file2 file3 file4 file5 file6 -d dirpath
for example,we have multiple fastq files, and use kmc.py to process into kmer files.
python ../mike-master/src/kmc.py -f /data0/stu_wangfang/tmptmp/E200008917_L01_171_1.fq.gz /data0/stu_wangfang/tmptmp/E200008917_L01_171_2.fq.gz -d /data0/stu_wangfang/tmp -t 10
We will get a kmer file(E200008917_L01_171.txt) in txt file format, the content of the file should be as follows.
If the python kmc.py script gives an error, you can also just run the kmc command to process all kmc-acceptable file formats into kmer files.
# kmc--first step
## single-end sequencing file
kmc -k21 -t10 INPUT.fastq OUTPUT_PREFIX DIRPATH
## paired-end sequencing file--write two sequencing files to a file list(INPUT.fastq.list)
kmc -k21 -t10 @INPUT.fastq.list OUTPUT_PREFIX DIRPATH
# kmc-second step
kmc_tools transform OUTPUT_PREFIX sort . dump -s OUTPUT_PREFIX.txt
If your file type is fasta , you need to add the -fm parameter.
for line in `ls | grep -E "E200008917*"`; do path=`pwd`; echo ${path}/${line} >> list; done
kmc -k21 -fq -t10 @list E200008917 .
kmc_tools transform E200008917 sort . dump -s E200008917.txt
the format of a kmer file should like below. Each line consists of a 21-mer string and a number representing the frequency of occurrence of that 21-mer string, separated by a '\t'.
AAAAAAAAAAAAAAAAAAAAA 255
AAAAAAAAAAAAAAAAAAAAC 255
AAAAAAAAAAAAAAAAAAAAG 255
AAAAAAAAAAAAAAAAAAAAT 255
AAAAAAAAAAAAAAAAAAACA 255
AAAAAAAAAAAAAAAAAAACC 255
AAAAAAAAAAAAAAAAAAACG 255
... ...
The filelist means the file that includes a list of kmer files. The filelist needs to include the absolute path and filename.
ABSOLUTE_PATH/kmer_name_file_1
ABSOLUTE_PATH/kmer_name_file_2
ABSOLUTE_PATH/kmer_name_file_3
... ...
The second step is to process the kmer files in the filelist as sketched files, note that you need to enter the absolute paths
./mike sketch -t 10 -l ABSOLUTE_PATH/filelist -d DIRPATH
First, create a list containing all the kmer files that need to be processed.You will get a file ending with 'jac'.
The sketched file is the file obtained in the previous step, which ends with 'jac'. The sketched filelist is the file that includes a list of sketched file.
ABSOLUTE_PATH/sketched_file_1.jac
ABSOLUTE_PATH/sketched_file_2.jac
ABSOLUTE_PATH/sketched_file_3.jac
... ...
compute the pairwise Jaccard coefficient, and then will generate the file named jaccard.txt in destination_path
./mike compute -l ABSOLUTE_PATH/sketched_filelist -L ABSOLUTE_PATH/sketched_filelist -d DIRPATH
compute the evolutionary distance,and then will generate the file named dist.txt in destination_path
./mike dist -l ABSOLUTE_PATH/sketched_filelist -L ABSOLUTE_PATH/sketched_filelist -d DIRPATH
using the evulutionary distance (dist.txt) to construct the phylogenetic tree without branch length. the file titled dist.txt was generated from the evolutionary distance
Rscript draw.r -f dist.txt -o dist.nwk
If the final step encounters an error, you can manually construct the phylogenetic tree by opening RStudio, download the ape package, and input the file dist.txt.
install.packages("ape")
library(ape)
tree <- read.csv("absolute_path/dist.txt", sep='\t', header = TRUE, row.names = 1)
treedist <- as.dist(tree)
# bionj
tree <- bionj(treedist)
# nj
tree <- nj(treedist)
# output
write.tree(tree, "tree.nwk")