BioinformaticsToolsmith / Identity

Other
33 stars 3 forks source link

Floating point exception (core dumped) #16

Open matnguyen opened 1 year ago

matnguyen commented 1 year ago

I am running the following FASTA file fasta.zip

Using ../meshclust -d file.fasta -o file.txt

MeShClust 2.0 is developed by Hani Z. Girgis, PhD.

This program clusters DNA sequences using identity scores obtained without alignment.

Copyright (C) 2021-2022 Hani Z. Girgis, PhD

Academic use: Affero General Public License version 1.

Any restrictions to use for profit or non-academics: Alternative commercial license is required.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY;
without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Please contact Dr. Hani Z. Girgis (hzgirgis@buffalo.edu) if you need more information.

Please cite the following papers:
        MeShClust v3.0: High-quality clustering of DNA sequences using the mean shift algorithm
        and alignment-free identity scores (2022). Hani Z. Girgis, BMC Genomics, 23(1):423.

        Identity: Rapid alignment-free prediction of sequence alignment identity scores using
        self-supervised general linear models (2021). Hani Z. Girgis, Benjamin T. James, and
        Brian B. Luczak. NAR Genom Bioinform, 13(1), lqab001.

        A survey and evaluations of histogram-based statistics in alignment-free sequence
        comparison (2019). Brian B. Luczak, Benjamin T. James, and Hani Z. Girgis. Briefings
        in Bioinformatics, 20(4):1222–1237.

        MeShClust: An intelligent tool for clustering DNA sequences (2018). Benjamin T. James,
        Brian B. Luczak, and Hani Z. Girgis. Nucleic Acids Res, 46(14):e83.

Database file: low_diversity_day.fasta
Output file: low_diversity_day.txt
Cores: 16
Estimating the threshold ...
Average: 29846
K: 7
Histogram size: 16384
A histogram entry is 32 bits.
Generating data.
Number of standard deviations: 2
Preparing data ...
        Positive examples: 9990
        Training size: 4995
        Validation size: 4995
Better performance of: 0.000105995
        jeffrey_divergence
Better performance of: 4.05578e-05
        jeffrey_divergence
        correlation x correlation^2
Better performance of: 3.28931e-05
        jeffrey_divergence
        simMM
        euclidean x cosine
        euclidean x correlation
        correlation x correlation^2
Better performance of: 2.8555e-05
        jeffrey_divergence
        simMM
        chi_squared^2
        minkowski^2
        euclidean x cosine
        euclidean x correlation
        chi_squared x cosine^2
        correlation x correlation^2
Selected statistics:
        jeffrey_divergence
        simMM
        chi_squared^2
        minkowski^2
        euclidean x cosine
        euclidean x correlation
        chi_squared x cosine^2
        correlation x correlation^2
Finished training.
        MAE: 0.00398844
        MSE: 2.8555e-05
Optimizing ...
Validating ...
        MAE: 0.00411611
        MSE: 3.08339e-05
Mean = 0.992131
STD = 3.33067e-16
Min = 0.992131
============================================
0.992131
Final threshold: 0.992131
Calculated threshold: 0.992131
Block size for all vs. all: 25000
Block size for reading sequences: 100000
Number of data passes: 10
Can assign all: No

Average: 29846
K: 7
Histogram size: 16384
A histogram entry is 32 bits.
Generating data.
Floating point exception (core dumped)
ashleyp1 commented 3 days ago

I got the same error and I finally figured it out! It's throwing an error because the estimated threshold is greater than 0.99. I went back through the README and found the threshold parameter and realized it was the problem.

-t: Optional. Threshold identity score (between 0 & 0.99) for determining cluster membership.

You can get around this by running meshclust -d file.fasta -o file.txt -t 0.99 in cases where the estimated threshold comes back as >0.99.