BioinformaticsToolsmith / Identity

Other
33 stars 3 forks source link

core dump #6

Open linzhi2013 opened 2 years ago

linzhi2013 commented 2 years ago

Hi there,

Thanks for the tool!

When I tried the meshclust 3.0, I got the core dump error, do you have any suggestions for this? thank you!

The compute-node of the cluster has 56 cores (112 threads), 1.5T RAM, and we did not limit how much RAM the meshclust would like to use.

Best Guanliang

-rw-rw-r-- 1 gmeng 1.5G Jun 13 17:15 combined.fa
-rw-rw-r-- 1 gmeng  112 Jun 14 10:00 meshclust3.sh
-rw-r--r-- 1 gmeng 5.9K Jun 15 20:16 meshclust3.sh.o539214
-rw------- 1 gmeng  18G Jun 15 22:18 core.229599
-rw-r--r-- 1 gmeng  416 Jun 15 22:18 meshclust3.sh.e539214
$ grep -c '>' combined.fa
5652580

meshclust3.sh:

/home/gmeng/soft/MeShClust_v3/Identity/bin/meshclust -d combined.fa -t 0.6  -o out.clstr -c 80 -e y -a n -p 10

meshclust3.sh.o539214:

MeShClust v3.0 is developed by Hani Z. Girgis, PhD.

This program clusters DNA sequences using identity scores obtained without alignment.

Copyright (C) 2021-2022 Hani Z. Girgis, PhD

Academic use: Affero General Public License version 1.

Any restrictions to use for profit or non-academics: Alternative commercial license is required.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY;
without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Please contact Dr. Hani Z. Girgis (hzgirgis@buffalo.edu) if you need more information.

Please cite the following papers:
    1. Identity: Rapid alignment-free prediction of sequence alignment identity scores using
    self-supervised general linear models. Hani Z. Girgis, Benjamin T. James, and Brian B.
    Luczak. NAR GAB, 3(1):lqab001, 2021.
    2. MeShClust: an intelligent tool for clustering DNA sequences. Benjamin T. James,
    Brian B. Luczak, and Hani Z. Girgis. Nucleic Acids Res, 46(14):e83, 2018.
    3. MeShClust v3.0: High-quality clustering of DNA sequences using the mean shift algorithm
    and alignment-free identity scores. Hani Z. Girgis. A great journal. 2022.

Database file: combined.fa
Output file: out.clstr
Cores: 80
Provided threshold: 0.6
Block size for all vs. all: 25000
Block size for reading sequences: 100000
Number of data passes: 10
Can assign all: No

Average: 756
K: 4
Histogram size: 256
A histogram entry is 16 bits.
Generating data.
Preparing data ...
    Positive examples: 10000
    Training size: 5000
    Validation size: 5000
Better performance of: 0.00324074
    chi_squared x jeffrey_divergence
Better performance of: 0.00278104
    chi_squared x jeffrey_divergence
    chi_squared^2 x d2_s_r^2
Better performance of: 0.00275123
    chi_squared x jeffrey_divergence
    chi_squared^2 x d2_s_r^2
    squared_chord^2 x hellinger^2
Better performance of: 0.00271437
    chi_squared x jeffrey_divergence
    chi_squared^2 x d2_s_r^2
    bray_curtis^2 x d2_s_r^2
    squared_chord^2 x hellinger^2
Better performance of: 0.00266334
    chi_squared x squared_chord
    chi_squared x jeffrey_divergence
    chi_squared^2 x d2_s_r^2
    bray_curtis^2 x d2_s_r^2
    squared_chord^2 x hellinger^2
    kulczynski_2^2 x d2_s_r^2
Better performance of: 0.00263148
    squared_chord
    chi_squared x squared_chord
    chi_squared x jeffrey_divergence
    chi_squared^2 x d2_s_r^2
    bray_curtis^2 x d2_s_r^2
    squared_chord^2 x hellinger^2
    kulczynski_2^2 x d2_s_r^2
Better performance of: 0.00257594
    squared_chord
    chi_squared x squared_chord
    chi_squared x jeffrey_divergence
    hellinger x hellinger^2
    chi_squared^2 x d2_s_r^2
    bray_curtis^2 x d2_s_r^2
    squared_chord^2 x hellinger^2
    kulczynski_2^2 x d2_s_r^2
Better performance of: 0.00249854
    squared_chord
    manhattan x simMM
    chi_squared x squared_chord
    chi_squared x jeffrey_divergence
    hellinger x hellinger^2
    chi_squared^2 x d2_s_r^2
    bray_curtis^2 x d2_s_r^2
    squared_chord^2 x hellinger^2
    kulczynski_2^2 x d2_s_r^2
Selected statistics:
    squared_chord
    manhattan x simMM
    chi_squared x squared_chord
    chi_squared x jeffrey_divergence
    hellinger x hellinger^2
    chi_squared^2 x d2_s_r^2
    bray_curtis^2 x d2_s_r^2
    squared_chord^2 x hellinger^2
    kulczynski_2^2 x d2_s_r^2
Finished training.
    MAE: 0.036734
    MSE: 0.00249854
Optimizing ...
Validating ...
    MAE: 0.0426102
    MSE: 0.00325363

Clustering ...

Data run 1 ...
    Processed sequences: 25000
    Unprocessed sequences: 0
    Found centers: 772
    Processed sequences: 50000
    Unprocessed sequences: 24657
    Found centers: 770
    Processed sequences: 100478
    Unprocessed sequences: 41448
    Found centers: 1278
    Processed sequences: 166024
    Unprocessed sequences: 32518
    Found centers: 2628
    Processed sequences: 206655
    Unprocessed sequences: 27580
    Found centers: 3034
    Processed sequences: 338846
    Unprocessed sequences: 65658
    Found centers: 3620
    Processed sequences: 348903
    Unprocessed sequences: 50307
    Found centers: 4308
    Processed sequences: 414183
    Unprocessed sequences: 67888
    Found centers: 4653
    Processed sequences: 428889
    Unprocessed sequences: 56801
    Found centers: 5147
    Processed sequences: 473924
    Unprocessed sequences: 66571
    Found centers: 5560
    Processed sequences: 591912
    Unprocessed sequences: 101368
    Found centers: 6457
    Processed sequences: 599863
    Unprocessed sequences: 83946
    Found centers: 6943
    Processed sequences: 682732
    Unprocessed sequences: 112078
    Found centers: 7277
    Processed sequences: 694499
    Unprocessed sequences: 97930
    Found centers: 7757
    Processed sequences: 752209
    Unprocessed sequences: 114752
    Found centers: 8067
    Processed sequences: 767163
    Unprocessed sequences: 94407
    Found centers: 8447
    Processed sequences: 867163
    Unprocessed sequences: 141679
    Found centers: 8792
    Processed sequences: 875812
    Unprocessed sequences: 125026
    Found centers: 9248
    Processed sequences: 950986
    Unprocessed sequences: 155363
    Found centers: 9586
    Processed sequences: 962281
    Unprocessed sequences: 137454
    Found centers: 10001
    Processed sequences: 1050620
    Unprocessed sequences: 173768
    Found centers: 10430
    Processed sequences: 1060816
    Unprocessed sequences: 156809
    Found centers: 10884
    Processed sequences: 1138833
    Unprocessed sequences: 189905
    Found centers: 11240
    Processed sequences: 1219898
    Unprocessed sequences: 191996
    Found centers: 12162
    Processed sequences: 1234377
    Unprocessed sequences: 173682
    Found centers: 12615
    Processed sequences: 1328038
    Unprocessed sequences: 210768
    Found centers: 13095
    Processed sequences: 1338108
    Unprocessed sequences: 194114
    Found centers: 13563
    Processed sequences: 1413309
    Unprocessed sequences: 217638
    Found centers: 13916
    Processed sequences: 1426200
    Unprocessed sequences: 203726
    Found centers: 14366
    Processed sequences: 1482720
    Unprocessed sequences: 217439
    Found centers: 14648
    Processed sequences: 1549592
    Unprocessed sequences: 216905
    Found centers: 15453
    Processed sequences: 1566431
    Unprocessed sequences: 205939
    Found centers: 15909
    Processed sequences: 1610994
    Unprocessed sequences: 211989
    Found centers: 16228

meshclust3.sh.e539214:

Mean 1 (mean1) and Mean 2 (mean2) cannot be zeros. Mean 1 is: 0, mean 2 is: 0.226562

terminate called after throwing an instance of 'std::exception'
  what():  std::exception
/opt/gridengine/default/spool/compute-0-0/job_scripts/539214: Zeile 1: 229599 Abgebrochen             (Speicherabzug geschrieben) /home/gmeng/soft/MeShClust_v3/Identity/bin/meshclust -d combined.fa -t 0.6 -o out.clstr -c 80 -e y -a n -p 10
hani-girgis commented 2 years ago

Hi, Guanliang.

Thanks for your interest in MeShClust.

I suspect that one of the sequences in combined.fa is too short or has many uncertain nucleotides, e.g., N. Can you please verify and let me know?

Best regards.

Hani Z. Girgis, PhD