TutteInstitute / fast_hdbscan

A fast multi-core implementation of HDBSCAN for low dimensional Euclidean spaces
BSD 2-Clause "Simplified" License
78 stars 8 forks source link

Windows fatal exception: stack overflow when the sample goes over a certain amount of points #19

Open fasensior opened 5 months ago

fasensior commented 5 months ago

Hey!

Just ran into a crash error when trying to get over 1 million objects to cluster using Fast HDBSCAN. The total length of my file of objects is around 2.5M. I am just using two columns of the object. I import it using pandas.dataframe and making sure that they are in float64. If I subsample the whole dataset below 1.1M objects, everything works fine (and fast), but after this point it just crashes. The whole error it returns is

Windows fatal exception: stack overflow

Thread 0x0000d814 (most recent call first): File "C:\Users\fasen\anaconda3\Lib\site-packages\zmq\utils\garbage.py", line 47 in run File "C:\Users\fasen\anaconda3\Lib\threading.py", line 1038 in _bootstrap_inner File "C:\Users\fasen\anaconda3\Lib\threading.py", line 995 in _bootstrap

Main thread:
Current thread 0x00010ba0 (most recent call first):
  File "C:\Users\fasen\anaconda3\Lib\site-packages\fast_hdbscan\hdbscan.py", line 168 in fast_hdbscan
  File "C:\Users\fasen\anaconda3\Lib\site-packages\fast_hdbscan\hdbscan.py", line 236 in fit
  File "c:\users\fasen\documents\universidad\master\1er_semestre\tecniques\p6\code.py", line 91 in <module>
  File "C:\Users\fasen\anaconda3\Lib\site-packages\spyder_kernels\py3compat.py", line 356 in compat_exec
  File "C:\Users\fasen\anaconda3\Lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 473 in exec_code
  File "C:\Users\fasen\anaconda3\Lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 615 in _exec_file
  File "C:\Users\fasen\anaconda3\Lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 528 in runfile
  File "C:\Users\fasen\AppData\Local\Temp\ipykernel_68656\701512081.py", line 1 in <module>

Restarting kernel...

I also attach the code

from sklearnex import patch_sklearn
patch_sklearn()

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import keras
import tensorflow as tf
from keras.models import Sequential
from keras.layers import Dense
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
import time
import matplotlib as mpl
from matplotlib.colors import Normalize, LogNorm
from sklearn.cluster import DBSCAN,HDBSCAN
import fast_hdbscan

mpl.rcParams['figure.dpi'] = 400
start = time.time()

dataset = pd.read_csv("C:/Users/fasen/Documents/Universidad/master/1er_semestre/tecniques/p6/data_gaia_edr3_reduced.csv",
                      header=0, dtype = np.float64)
n=1300000
X_train = dataset.sample(n)
X_train = X_train[['VR','Vphi']]
# =============================================================================
# clustering = DBSCAN(eps=0.5, min_samples=4,algorithm='ball_tree', metric='haversine').fit(X_train)
# DBSCAN_dataset = X_train.copy()
# DBSCAN_dataset.loc[:,'Cluster'] = clustering.labels_ 
# =============================================================================
start = time.time()
clustering = fast_hdbscan.HDBSCAN(min_cluster_size=20).fit(X_train)
finish = time.time()
print('Computation time for ', n, ' samples of the total: ', start-finish, ' s')
DBSCAN_dataset = X_train.copy()
DBSCAN_dataset.loc[:,'Cluster'] = clustering.labels_
DBSCAN_dataset.Cluster.value_counts().to_frame()

The file I am using is a copy of the EDR3 of Gaia Data, just in case it is important.

Thanks!!