MeShClust 2.0 is developed by Hani Z. Girgis, PhD.
This program clusters DNA sequences using identity scores obtained without alignment.
Copyright (C) 2021-2022 Hani Z. Girgis, PhD
Academic use: Affero General Public License version 1.
Any restrictions to use for profit or non-academics: Alternative commercial license is required.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY;
without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Please contact Dr. Hani Z. Girgis (hzgirgis@buffalo.edu) if you need more information.
Please cite the following papers:
MeShClust v3.0: High-quality clustering of DNA sequences using the mean shift algorithm
and alignment-free identity scores (2022). Hani Z. Girgis, BMC Genomics, 23(1):423.
Identity: Rapid alignment-free prediction of sequence alignment identity scores using
self-supervised general linear models (2021). Hani Z. Girgis, Benjamin T. James, and
Brian B. Luczak. NAR Genom Bioinform, 13(1), lqab001.
A survey and evaluations of histogram-based statistics in alignment-free sequence
comparison (2019). Brian B. Luczak, Benjamin T. James, and Hani Z. Girgis. Briefings
in Bioinformatics, 20(4):1222–1237.
MeShClust: An intelligent tool for clustering DNA sequences (2018). Benjamin T. James,
Brian B. Luczak, and Hani Z. Girgis. Nucleic Acids Res, 46(14):e83.
Database file: low_diversity_day.fasta
Output file: low_diversity_day.txt
Cores: 16
Estimating the threshold ...
Average: 29846
K: 7
Histogram size: 16384
A histogram entry is 32 bits.
Generating data.
Number of standard deviations: 2
Preparing data ...
Positive examples: 9990
Training size: 4995
Validation size: 4995
Better performance of: 0.000105995
jeffrey_divergence
Better performance of: 4.05578e-05
jeffrey_divergence
correlation x correlation^2
Better performance of: 3.28931e-05
jeffrey_divergence
simMM
euclidean x cosine
euclidean x correlation
correlation x correlation^2
Better performance of: 2.8555e-05
jeffrey_divergence
simMM
chi_squared^2
minkowski^2
euclidean x cosine
euclidean x correlation
chi_squared x cosine^2
correlation x correlation^2
Selected statistics:
jeffrey_divergence
simMM
chi_squared^2
minkowski^2
euclidean x cosine
euclidean x correlation
chi_squared x cosine^2
correlation x correlation^2
Finished training.
MAE: 0.00398844
MSE: 2.8555e-05
Optimizing ...
Validating ...
MAE: 0.00411611
MSE: 3.08339e-05
Mean = 0.992131
STD = 3.33067e-16
Min = 0.992131
============================================
0.992131
Final threshold: 0.992131
Calculated threshold: 0.992131
Block size for all vs. all: 25000
Block size for reading sequences: 100000
Number of data passes: 10
Can assign all: No
Average: 29846
K: 7
Histogram size: 16384
A histogram entry is 32 bits.
Generating data.
Floating point exception (core dumped)
I am running the following FASTA file fasta.zip
Using
../meshclust -d file.fasta -o file.txt