UBod / apcluster

R package implementing affinity propagation clustering along with various utilities
https://github.com/UBod/apcluster
9 stars 8 forks source link

Trying to analyze large datasets #6

Closed iago-pssjd closed 10 months ago

iago-pssjd commented 10 months ago

Hi,

I was trying to apply apcluster to a dataset of more than 700000 observations. I saw later than for large datasets I should use apclusterL. In any case I get

Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'x' in selecting a method for function 'as.matrix': negative length vectors are not allowed

This happens when calling CdistR in https://github.com/UBod/apcluster/blob/4109da5cd83678e01c4103de73cc788b7614e193/R/simpleDist.R#L30

For a dataset of 50000 (debugging just sampleDist), this function works and I get Sin título

but later it fails with https://github.com/UBod/apcluster/blob/4109da5cd83678e01c4103de73cc788b7614e193/R/simpleDist.R#L33

Error: cannot allocate vector of size 9.3 Gb

Although trying the 50000 dataset in apclusterL it seems that R hangs before executing simpleDist. with frac = 0.01, sweeps = 3 I get

Warning message:
In apclusterL.matrix(s = sim, sel = sel, p = p, q = q, maxits = maxits,  :
  algorithm did not converge; turn on details
and call plot() to monitor net similarity. Consider
increasing 'maxits' and 'convits', and, if oscillations occur,
also increasing damping factor 'lam'.

Is there anything I can do?

Thanks!

UBod commented 10 months ago

Thanks for reporting these issues. Some comments:

As said, apcluster() cannot handle data sets of that size. The problem is that an NxN similarity matrix needs to be created (and computed). 700,000^2 = 4.9e9, which is larger than the largest index you can express with a 32-bit integer. That is why simpleDistR() runs into an overflow and why it cannot allocate a vector of that size.

apclusterL() uses only random subsets of samples to consider as potential exemplars. In your case, you use frac=0.01 with l=50000 samples, which means that you use 1% of 50,000 samples, i.e. 500 samples to consider as potential exemplars. You perform this subsampling three times (sweeps=3). In any case, I would increase sweeps to get a better solution; frac seems fine. Depending on how clear the cluster structure is, the algorithm might find a good solution. If the cluster structure is not clear or if the subsample does not contain good exemplars, then apclusterL() might have problems to find the right solution in the given number of iterations. The options you have are: