This is an experiment in Python using approaches from the ICML '15 paper “An embarrassingly simple approach to zero-shot learning” by Bernardino Romera-Paredes, Philip H. S. Torr.
http://jmlr.org/proceedings/papers/v37/romera-paredes15.pdf
With matrix factorization you can decompose a n*m
matrix into a n*a
matrix and a a*m
matrix, where a
is the number of latent features.
An embarrassingly simple approach to zero-shot learning uses this to do zero-shot learning.
During the training stage an n*m
weight/coefficient matrix is trained, where n
is the number of features and m
is the number of classes. Such that np.argmax( np.dot( x.T, weight_matrix ) )
= predicted class.
They also train an a*m
Signature matrix in an unsupervised manner. a
is a number of (binary or soft) class attributes which can be found in the dataset, from external data, or in an unsupervised manner.
For instance, when the classes are: bear
and horse
and the attributes are [brown, can_ride, domesticated]
the Signature matrix may look like:
bear horse
[ 1, 1 ] brown
[ 0, 1 ] can_ride
[ 0, 1 ] domesticated
Using the Signature matrix and the Weight matrix we calculate an n*a
matrix V
, such that np.dot(V, S) = ~W
.
When we want to predict new classes we create a new signature matrix S'
.
moose donkey tiger
[ 1, 1, 0 ] brown
[ 0, 1, 0 ] can_ride
[ 0, 1, 0 ] domesticated
We use np.dot( V, S')
to obtain ~W'
. For new test samples we do np.argmax( np.dot( x.T, ~W' )
to get our class prediction.
We create the Weight matrix using logistic regression.
We create attributes with unsupervised learning.
We take the first 2 components of:
to create 4 class attributes. For every train sample belonging to a class we average the 4-dimensional PCA_LLE filter to get our Signature matrix. For instance, with 2 classes digit1
and digit3
:
digit1 digit3
[ 0.05, 0.06 ] PCA1
[ 0.11, 0.96 ] PCA2
[ 0.45, 0.11 ] LLE1
[ 0.95, 0.13 ] LLE2
When we want to predict new classes we take at least 1 sample and use the fitted PCA and LLE models to get a 4-dimensional vector. Taking the average of more samples per class improves performance.
digit7 digit9
[ 0.04, 0.19 ] PCA1
[ 0.12, 0.76 ] PCA2
[ 0.49, 0.14 ] LLE1
[ 0.85, 0.11 ] LLE2
We use a 10 class digit dataset with 64 features. We will use digits 0,1,2,7,8,9 for the seen classes. For the unseen digits we use 3,4,5,6.
After fitting logistic regression (0.911
accuracy) on the 6 seen classes our Weight coefficient matrix is 64*6
. Our Signature matrix is 4*6
. We calculate a 64*4
matrix V.
To create predictions for the unseen classes we take 1 sample per new class and transform them with PCA and LLE to get our Signature matrix S'
.
We use V
and S'
to calculate ~W'
with size 64*4
. Now for all test samples we calculate np.argmax( np.dot( x.T, ~W' )
to get a class prediction.
We obtain a multi-class accuracy of 0.846
with 50 labeled samples per class, 0.759
with 10 labeled samples per class, and a variant accuracy of 0.609
with 1 labeled sample per class. By comparison: random guessing is 0.25
accuracy.
We used a very simple toy dataset and non-rigorous method of evaluation. The goal was to replicate the basic idea with an extremely simple baseline, not to obtain (or claim) state-of-the-art performance.
One-shot learning has less constraints than zero-shot learning approach (we need at least one labeled sample, or another model communicating this as a vector). But we do get to use this approach when no class attributes are available.
We completely gloss over one of the main contributions in the paper: Regularization of the V matrix. We calculate V from W and S with least-squares. The paper includes a regularizer with more favourable properties.
See the original Matlab code here: https://dl.dropboxusercontent.com/u/5961057/ESZSL_v0.1.zip and a repository with the code for the real data experiments in the paper here: https://github.com/bernard24/Embarrassingly-simple-ZSL