Trying to analyze large datasets

UBod / apcluster

R package implementing affinity propagation clustering along with various utilities

9 stars 8 forks source link

Hi,

I was trying to apply apcluster to a dataset of more than 700000 observations. I saw later than for large datasets I should use apclusterL. In any case I get

Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'x' in selecting a method for function 'as.matrix': negative length vectors are not allowed

This happens when calling CdistR in https://github.com/UBod/apcluster/blob/4109da5cd83678e01c4103de73cc788b7614e193/R/simpleDist.R#L30

For a dataset of 50000 (debugging just sampleDist), this function works and I get Sin título

but later it fails with https://github.com/UBod/apcluster/blob/4109da5cd83678e01c4103de73cc788b7614e193/R/simpleDist.R#L33

Error: cannot allocate vector of size 9.3 Gb

Although trying the 50000 dataset in apclusterL ~~it seems that R hangs before executing simpleDist.~~ with frac = 0.01, sweeps = 3 I get

Warning message:
In apclusterL.matrix(s = sim, sel = sel, p = p, q = q, maxits = maxits,  :
  algorithm did not converge; turn on details
and call plot() to monitor net similarity. Consider
increasing 'maxits' and 'convits', and, if oscillations occur,
also increasing damping factor 'lam'.

Is there anything I can do?

Thanks!

Thanks for reporting these issues. Some comments:

As said, apcluster() cannot handle data sets of that size. The problem is that an NxN similarity matrix needs to be created (and computed). 700,000^2 = 4.9e9, which is larger than the largest index you can express with a 32-bit integer. That is why simpleDistR() runs into an overflow and why it cannot allocate a vector of that size.

apclusterL() uses only random subsets of samples to consider as potential exemplars. In your case, you use frac=0.01 with l=50000 samples, which means that you use 1% of 50,000 samples, i.e. 500 samples to consider as potential exemplars. You perform this subsampling three times (sweeps=3). In any case, I would increase sweeps to get a better solution; frac seems fine. Depending on how clear the cluster structure is, the algorithm might find a good solution. If the cluster structure is not clear or if the subsample does not contain good exemplars, then apclusterL() might have problems to find the right solution in the given number of iterations. The options you have are:

Decrease convits (then the algorithm is satisfied with the solution earlier and does not require the solution to remain stable for 100 iterations, which is the default).
Increase maxits (then the algorithm has more time to find the right solution).
Increase frac (then the algorithm has a wider choice of potential exemplars; note that this only works if the data set is small enough; for 50,000, however, that should not be an issue).

UBod / apcluster

Trying to analyze large datasets #6