garydoranjr / misvm

Multiple-Instance Support Vector Machines
BSD 3-Clause "New" or "Revised" License
234 stars 81 forks source link

same sign for all instance-level prediction #16

Open vista-analytics opened 6 years ago

vista-analytics commented 6 years ago

I generated some synthetic data using the 20newsgroup to run experiment on mi-SVM and MI-SVM. I noticed that, if I predict the labels in instance-level (actually that happens for me on bag-level too), all predictions share the same sign (either positive or negative, depending on the data). The AUC looks good, which means the ranking is correct. I guess this might due to some issues caused by library version, or not? Does anyone come across the same issue? Or, which versions of libraries should we use? Thanks!

garydoranjr commented 6 years ago

I have not observed this behavior before; in the past, I have seen bag-level accuracies above chance level. But if all bags have the same sign, then I would have seen accuracy at chance level only. Maybe it has something to do with how the synthetic data is generated. Are the classes highly imbalanced at the bag level?

vista-analytics commented 6 years ago

Thank you Gary for the comment. I guess it's not because of the data, since I came across the same problem when I ran the example.py code using the musk1 dataset. In that code, I added a new classifier: miSVM(kernel='linear', C=1.0, max_iters=10). This classifier terminates after iteration 1. It told me that Class Changes = 0. When I added a print(svm._predictions) statement in the miSVM source code, it seems that all predictions are positive and the values are very close, which explains why Class Changes = 0 and the code terminates at iteration 1. It also happens to my synthetic data, so I guess it might be library version issue? Your insights are highly appreciated!

garydoranjr commented 6 years ago

I tried to replicate the results you are getting. I added miSVM to the example.py script and it runs with the following output:

$ ./example.py
Non-random start...

Iteration 1...
Training SVM...
     pcost       dcost       gap    pres   dres
 0: -4.7135e+01 -1.9465e+00  3e+03  5e+01  7e-09
 1: -6.9802e-01 -1.9425e+00  3e+01  6e-01  7e-09
 2: -2.2093e-01 -1.6692e+00  5e+00  6e-02  8e-10
 3: -1.4628e-01 -1.1486e+00  2e+00  3e-02  3e-10
 4: -9.0601e-02 -6.0664e-01  9e-01  9e-03  1e-10
 5: -4.3735e-02 -3.0183e-01  4e-01  3e-03  4e-11
 6: -2.4291e-02 -1.3206e-01  2e-01  1e-03  2e-11
 7: -1.5611e-02 -4.3816e-02  4e-02  2e-04  1e-11
 8: -1.6996e-02 -2.3812e-02  8e-03  3e-05  8e-12
 9: -1.7919e-02 -1.9351e-02  2e-03  4e-06  8e-12
10: -1.8219e-02 -1.8410e-02  2e-04  4e-07  8e-12
11: -1.8269e-02 -1.8278e-02  9e-06  2e-08  8e-12
12: -1.8272e-02 -1.8272e-02  4e-07  6e-10  9e-12
13: -1.8272e-02 -1.8272e-02  3e-08  4e-11  8e-12
Optimal solution found.
Recomputing classes...
Class Changes: 0
Test labels: [ 1. -1. -1. -1.  1.  1. -1.  1.  1.  1.]
Predictions: [ 1. -1. -1. -1.  1.  1.  1.  1.  1.  1.]

miSVM Accuracy: 90.0%

So it also finishes after one iteration, but i get predictions on the test set that are not all of the same sign. I added the following lines to print those out:

         predictions = classifier.predict(test_bags)
+        print('Test labels: %s' % str(test_labels))
+        print('Predictions: %s' % (np.sign(predictions)))
         accuracies[algorithm] = np.average(test_labels == np.sign(predictions))

Here are the list of packages I have installed and their versions:

alabaster==0.7.6
altgraph==0.15
appdirs==1.4.3
attrs==17.4.0
Babel==2.5.3
backports==1.0
backports-abc==0.5
backports.functools-lru-cache==1.2.1
backports.ssl-match-hostname==3.5.0.1
basemap==1.0.7
Beaker==1.8.1
beautifulsoup4==4.6.0
certifi==2018.1.18
cffi==1.11.4
chardet==3.0.4
Cheetah3==3.0.0
CherryPy==5.0.1
colored==1.3.5
cuttime==0.1
cvxopt==1.1.8
cycler==0.10.0
Cython==0.27.3
decorator==4.2.1
docutils==0.14
emd==1.0
flux-emd==1.0
flux-kernel==1.0
flux-migraph==1.0
funcsigs==1.0.2
functools32==3.2.3.post2
gdbm==2.7.14
gps==3.17
h5py==2.7.0
httplib2==0.9.2
idna==2.6
imagesize==0.7.1
Jinja2==2.10
libxml2-python==2.9.7
lxml==4.1.1
macholib==1.9
Mako==1.0.7
MarkupSafe==0.23
matplotlib==2.1.1
misvm==1.0
modulegraph==0.16
monotonic==1.4
mpmath==0.19
netCDF4==1.2.9
nose==1.3.7
numpy==1.14.0
oauth2==1.9.0.post1
olefile==0.44
passmash===master
pdfminer==20140328
Pillow==5.0.0
pkgconfig==1.1.0
pluggy==0.6.0
Polygon2==2.0.8
progressbar==2.3
py==1.5.2
py2app==0.14
pycairo==1.15.4
pycparser==2.18
Pygments==2.2.0
pygobject==3.26.1
pyobjc-core==3.0.4
pyobjc-framework-Cocoa==3.0.4
pyopencl==2017.2.2
PyOpenGL==3.1.0
PyOpenGL-accelerate==3.1.0
pyparsing==2.2.0
pytest==3.3.2
python-dateutil==2.6.1
pytools==2017.6
pytz==2017.3
PyYAML==3.12
pyzmq==16.0.4
rdc==1.0
requests==2.18.4
roman==2.0.0
scikit-learn==0.17.1
scipy==1.0.0
simplejson==3.6.5
singledispatch==3.4.0.3
six==1.11.0
snowballstemmer==1.2.0
Sphinx==1.6.6
sphinx-rtd-theme==0.2.4
sphinxcontrib-websupport==1.0.1
StereoVision==1.0.0
subprocess32==3.2.7
sympy==1.0
termcolor==1.1.0
Tkinter==0.0.0
tornado==4.5.2
tsne==0.1.1
typing==3.6.2
Unidecode==1.0.22
urllib3==1.22
virtualenv==15.1.0
wordcloud==1.2.1
wxPython==3.0.2.0
wxPython-common==3.0.2.0
yelp==1.0.2
vista-analytics commented 6 years ago

Thank you Gary. I will re-run the experiment using your library version. Thanks.

ventouris commented 4 years ago

Have you been able to find the problem? I have the same issue in my data as well. All predictions are from one class and the values too close to each other.