BioinformaticsToolsmith / Identity

Other
32 stars 3 forks source link

How to get MeShClust v3.0.0 #10

Open nvucic opened 1 year ago

nvucic commented 1 year ago

Sorry there must be some resource I'm missing but could not find the latest v3.0.0

hani-girgis commented 1 year ago

This is the right repository. Please follow the posted instructions. Once compilation is done, you will find identity and meshclust v3.0.

sguizard commented 1 year ago

@hani-girgis Thanks for this tool. I think the confusion come from the version displayed by the program when it run. It shows MeShClust v2.0.

MeShClust 2.0 is developed by Hani Z. Girgis, PhD.

This program clusters DNA sequences using identity scores obtained without alignment.

Copyright (C) 2021-2022 Hani Z. Girgis, PhD

Academic use: Affero General Public License version 1.

Any restrictions to use for profit or non-academics: Alternative commercial license is required.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY;
without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Please contact Dr. Hani Z. Girgis (hzgirgis@buffalo.edu) if you need more information.

Please cite the following papers: 
    MeShClust v3.0: High-quality clustering of DNA sequences using the mean shift algorithm
    and alignment-free identity scores (2022). Hani Z. Girgis, BMC Genomics, 23(1):423.

    Identity: Rapid alignment-free prediction of sequence alignment identity scores using
    self-supervised general linear models (2021). Hani Z. Girgis, Benjamin T. James, and
    Brian B. Luczak. NAR Genom Bioinform, 13(1), lqab001.

    A survey and evaluations of histogram-based statistics in alignment-free sequence
    comparison (2019). Brian B. Luczak, Benjamin T. James, and Hani Z. Girgis. Briefings
    in Bioinformatics, 20(4):1222–1237.

    MeShClust: An intelligent tool for clustering DNA sequences (2018). Benjamin T. James,
    Brian B. Luczak, and Hani Z. Girgis. Nucleic Acids Res, 46(14):e83.

Database file: mono.fasta
Output file: test.txt
Cores: 16
Provided threshold: 0.8
Block size for all vs. all: 25000
Block size for reading sequences: 100000
Number of data passes: 10
Can assign all: No

Average: 2273
K: 5
Histogram size: 1024
A histogram entry is 16 bits.
Generating data.
Preparing data ...
    Positive examples: 10000
    Training size: 5000
    Validation size: 5000
Better performance of: 0.00155154
    jeffrey_divergence x simMM
Better performance of: 0.0012286
    correlation x d2_s_r^2
    jeffrey_divergence x simMM
Better performance of: 0.00110226
    minkowski x simMM^2
    correlation x d2_s_r^2
    jeffrey_divergence x simMM
Better performance of: 0.0010716
    minkowski x sim_ratio^2
    minkowski x simMM^2
    correlation x d2_s_r^2
    jeffrey_divergence x simMM
Better performance of: 0.00103351
    jeffrey_divergence
    minkowski x sim_ratio^2
    minkowski x simMM^2
    correlation x d2_s_r^2
    jeffrey_divergence x simMM
Better performance of: 0.000955872
    minkowski
    jeffrey_divergence
    minkowski x sim_ratio^2
    minkowski x simMM^2
    correlation x d2_s_r^2
    jeffrey_divergence x simMM
Better performance of: 0.000905404
    minkowski
    jeffrey_divergence
    chi_squared x sim_ratio
    minkowski x sim_ratio^2
    minkowski x simMM^2
    correlation x d2_s_r^2
    jeffrey_divergence x simMM
Better performance of: 0.000880007
    minkowski
    jeffrey_divergence
    chi_squared x sim_ratio
    minkowski x sim_ratio^2
    minkowski x simMM^2
    correlation x d2_s_r^2
    jeffrey_divergence x simMM
    squared_chord^2 x simMM^2
Better performance of: 0.000835517
    minkowski
    jeffrey_divergence
    chi_squared x sim_ratio
    minkowski x sim_ratio^2
    minkowski x simMM^2
    correlation x d2_s_r^2
    jeffrey_divergence x simMM
    squared_chord^2 x sim_ratio^2
    squared_chord^2 x simMM^2
Better performance of: 0.000806042
    minkowski
    jeffrey_divergence
    chi_squared x sim_ratio
    minkowski x sim_ratio^2
    minkowski x simMM^2
    correlation x d2_s_r^2
    jeffrey_divergence x simMM
    sim_ratio x d2_s_r^2
    chi_squared^2 x d2_s_r^2
    squared_chord^2 x sim_ratio^2
    squared_chord^2 x simMM^2
Selected statistics:
    minkowski
    jeffrey_divergence
    chi_squared x sim_ratio
    minkowski x sim_ratio^2
    minkowski x simMM^2
    correlation x d2_s_r^2
    jeffrey_divergence x simMM
    sim_ratio x d2_s_r^2
    chi_squared^2 x d2_s_r^2
    squared_chord^2 x sim_ratio^2
    squared_chord^2 x simMM^2
Finished training.
    MAE: 0.0177417
    MSE: 0.000806042
Optimizing ...
Validating ...
    MAE: 0.0231115
    MSE: 0.00118335

Clustering ... 

Data run 1 ...
    Processed sequences: 13486
    Unprocessed sequences: 0
    Found centers: 149

Assigning ...
Finished.

Thanks for using MeShClust v2.0. Please post any questions or problems on GitHub: 
https://github.com/BioinformaticsToolsmith/Identity or email Dr. Hani Z. Girgis.
simonorozcoarias commented 4 months ago

Hi @hani-girgis Thank you for this amazing tool. I am trying to get MeShClust 3.0 from the last release (V.2.0). Nevertheless, after sucessfully compile identity, MeShClust is not appearing. Actually, I was looking for the MeShClust source code and is not even there.

I also tried to get MeShClust 3.0 from the master branch, but it is generating the identity v.1.2 and MeShClust v.2.0.

So, how can I get the MeShClust version 3.0?

Thank you for the help.

Best,

Simon.