Closed iago-pssjd closed 10 months ago
Thanks for reporting these issues. Some comments:
As said, apcluster()
cannot handle data sets of that size. The problem is that an NxN similarity matrix needs to be created (and computed). 700,000^2 = 4.9e9, which is larger than the largest index you can express with a 32-bit integer. That is why simpleDistR()
runs into an overflow and why it cannot allocate a vector of that size.
apclusterL()
uses only random subsets of samples to consider as potential exemplars. In your case, you use frac=0.01
with l=50000
samples, which means that you use 1% of 50,000 samples, i.e. 500 samples to consider as potential exemplars. You perform this subsampling three times (sweeps=3
). In any case, I would increase sweeps
to get a better solution; frac seems fine. Depending on how clear the cluster structure is, the algorithm might find a good solution. If the cluster structure is not clear or if the subsample does not contain good exemplars, then apclusterL()
might have problems to find the right solution in the given number of iterations. The options you have are:
convits
(then the algorithm is satisfied with the solution earlier and does not require the solution to remain stable for 100 iterations, which is the default).maxits
(then the algorithm has more time to find the right solution).frac
(then the algorithm has a wider choice of potential exemplars; note that this only works if the data set is small enough; for 50,000, however, that should not be an issue).
Hi,
I was trying to apply
apcluster
to a dataset of more than 700000 observations. I saw later than for large datasets I should useapclusterL
. In any case I getThis happens when calling
CdistR
in https://github.com/UBod/apcluster/blob/4109da5cd83678e01c4103de73cc788b7614e193/R/simpleDist.R#L30For a dataset of 50000 (debugging just
sampleDist
), this function works and I getbut later it fails with https://github.com/UBod/apcluster/blob/4109da5cd83678e01c4103de73cc788b7614e193/R/simpleDist.R#L33
Although trying the 50000 dataset in
apclusterL
it seems that R hangs before executingwithsimpleDist
.frac = 0.01, sweeps = 3
I getIs there anything I can do?
Thanks!