djsutherland / skl-groups

scikit-learn addon to operate on set/"group"-based features
BSD 3-Clause "New" or "Revised" License
41 stars 7 forks source link

inf and nan values when linear and l2 divergences are used #32

Closed kayhan-batmanghelich closed 8 years ago

kayhan-batmanghelich commented 8 years ago

Hi @dougalsutherland ,

Thanks for sharing your code; it is well documented and well written.

I am working on a problem and comparing different divergences. KL and Hellinger already produce good results but I am interested to compute linear affinity for a different purpose. Unfortunately more than 99% of the computed affinity values are infinity and the rest are very large number. Do you why is that and it can be resolved? l2 distance also produce a lot of inf and nan.

Thanks, Kayhan

djsutherland commented 8 years ago

Hi Kayhan,

Thanks!

The L2 implementation uses the linear results (||p - q||^2 = <p, p> + <q, q> - 2 <p, q>), so if linear is coming out nan/inf then it's not surprising that L2 is too.

What kind of dimensionality / numbers of points per set are you working with? It'd also help to have a small dataset that reproduces the problem, to look into it further.

kayhan-batmanghelich commented 8 years ago

Hi Douglas,

The bag sizes vary from 70 to 400. The dim of feature vector is 63. The results are pretty reasonable for kl and other alpha divergences (>0.8). But linear and l2 produce nan/inf.

Since my pipeline is a bit complicated, I thought let's replicate the problem on a public dataset (mnist). Here is code but there is something wrong in this example and I cannot figure it out:

from  __future__  import print_function

import numpy as np
from  matplotlib import pylab as pylab
from sklearn.pipeline import Pipeline
from skl_groups.divergences import KNNDivergenceEstimator
from skl_groups.kernels import PairwisePicker, Symmetrize, RBFize, ProjectPSD
from skl_groups.features import Features
from sklearn.manifold import LocallyLinearEmbedding
from sklearn.datasets import fetch_mldata

mnist = fetch_mldata("MNIST original")
X, y = mnist.data / 255., mnist.target
bags = []
j = 0
for i in xrange(0,70000,1000):
    bags.append(X[i:i+1000])
feats = Features( bags  )

knnDiv = KNNDivergenceEstimator(div_funcs=['hellinger'], Ks=[3], n_jobs=-1)
D = knnDiv.fit_transform(feats);D = D.squeeze()

It complains:

/Users/kayhan/anaconda/bin/python /Users/kayhan/Projects/warpingVB/src/test_linearDiv.py
/Users/kayhan/anaconda/lib/python2.7/site-packages/skl_groups/divergences/knn.py:394: UserWarning: Using 'slow' version of KNNDivergenceEstimator,  because skl_groups_accel isn't available; its 'fast' version is much faster on large problems. Pass version='slow' to suppress this warning. 
  No module named skl_groups_accel.knn_divs
  "warning. \n  {}".format(fast_version_error))
Traceback (most recent call last):
  File "/Users/kayhan/Projects/warpingVB/src/test_linearDiv.py", line 25, in <module>
    D = knnDiv.fit_transform(feats);D = D.squeeze()
  File "/Users/kayhan/anaconda/lib/python2.7/site-packages/sklearn/base.py", line 433, in fit_transform
    return self.fit(X, **fit_params).transform(X)
  File "/Users/kayhan/anaconda/lib/python2.7/site-packages/skl_groups/divergences/knn.py", line 310, in fit
    self.indices_ = id = memory.cache(_build_indices)(X, self._flann_args())
  File "/Users/kayhan/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/memory.py", line 281, in __call__
    return self.func(*args, **kwargs)
  File "/Users/kayhan/anaconda/lib/python2.7/site-packages/skl_groups/divergences/knn.py", line 410, in _build_indices
    idx.build_index(bag)
  File "/Users/kayhan/anaconda/lib/python2.7/site-packages/pyflann/index.py", line 159, in build_index
    raise FLANNException('Cannot handle type: %s' % pts.dtype)
pyflann.exceptions.FLANNException: Cannot handle type: object

when I trace it, apparently during the call to fit, datatype of feat changes to object. Anyway, this doesn't work or we cannot replicate the nan/inf issue with this, I have to go back and somehow find a simple dataset to replicate my problem.

Thanks,

djsutherland commented 8 years ago

I'm not sure why this is happening, and will take a look soon.

In any case, you could probably post a subset of your data without having to reproduce the pipeline by doing

subset = feats[:10]  # assuming the problem happens in the first 10 bags
import cPickle as pickle
with open('subset.pkl', 'wb') as f:
     pickle.dump(subset, f)

and then posting the pickle file somewhere (e.g. a gist).

kayhan-batmanghelich commented 8 years ago

Hi Douglas,

Here is an example of data: https://www.dropbox.com/s/kzq8goj6jg505qc/save.p?dl=0

all entries of the kernel are inf:

feats = Features(bags)

In [20]: knnDiv = KNNDivergenceEstimator(div_funcs=['linear'],Ks=[3], n_jobs=-1)

In [21]: K = knnDiv.fit_transform(feats)
/home/batmanghelich/anaconda2/lib/python2.7/site-packages/skl_groups/divergences/knn.py:394: UserWarning: Using 'slow' version of KNNDivergenceEstimator,  because skl_groups_accel isn't available; its 'fast' version is much faster on large problems. Pass version='slow' to suppress this warning.
  No module named skl_groups_accel.knn_divs
  "warning. \n  {}".format(fast_version_error))

In [22]: K
Out[22]:
array([[[[ inf,  inf,  inf,  inf,  inf,  inf,  inf,  inf,  inf,  inf],
         [ inf,  inf,  inf,  inf,  inf,  inf,  inf,  inf,  inf,  inf],
         [ inf,  inf,  inf,  inf,  inf,  inf,  inf,  inf,  inf,  inf],
         [ inf,  inf,  inf,  inf,  inf,  inf,  inf,  inf,  inf,  inf],
         [ inf,  inf,  inf,  inf,  inf,  inf,  inf,  inf,  inf,  inf],
         [ inf,  inf,  inf,  inf,  inf,  inf,  inf,  inf,  inf,  inf],
         [ inf,  inf,  inf,  inf,  inf,  inf,  inf,  inf,  inf,  inf],
         [ inf,  inf,  inf,  inf,  inf,  inf,  inf,  inf,  inf,  inf],
         [ inf,  inf,  inf,  inf,  inf,  inf,  inf,  inf,  inf,  inf],
         [ inf,  inf,  inf,  inf,  inf,  inf,  inf,  inf,  inf,  inf]]]], dtype=float32)

Thanks,

djsutherland commented 8 years ago

Kayhan,

Sorry it took me so long to get to this; have a lot going on right now. Tracking it down, the problem is due to an overflow error: it's trying to take numbers between (in this case) 0.05 and 0.35 to the power of -63, getting results between 10^30 and 10^80, and overflowing.

I have a partial implementation of doing these computations in log space (on master) and of doing the computations in float64s instead of 32s (on a branch, since I'm running into weird problems with it). But in any case, the estimated inner products are still going to be absurdly large and not useful to you.

Basically, the linear kNN estimator is just really bad in high dimensions....

This problem is essentially equivalent to density estimation: it's doing

\int p(x) q(x) dx = E_{X ~ p} q(X) = E_{P ~ q} p(Y)

High-dimensional density estimation is hard. But if you find an approach that makes sense for your problem – perhaps a sparse nonparametric graphical model or an infinite dimensional exponential family – you can estimate \int p(x) q(x) dx by just evaluating it on the samples from the other set.

kayhan-batmanghelich commented 8 years ago

@dougalsutherland Thank you for your reply. Those are excellent references; thank you for sharing.

Yes, that's right. This is why I was curious to see if I can use any structure in the similarity between bags. I guess too good to be true :)

BTW, I will be local. If you are still in Pittsburgh and have time and fancy a coffee, we can chat about research.

Thank you for sharing your code, it is a great package.

djsutherland commented 8 years ago

@kayhan-batmanghelich Sure, I'll be around in Pittsburgh through mid-September (then I'm moving to London). Shoot me an email (on my profile page) when you're here and I'd love to chat about what you're doing.