ExaScience / smurff

Bayesian Factorization with Side Information in C++ with Python wrapper
MIT License
70 stars 14 forks source link

Out-of-matrix prediction script issue #119

Closed tangoed2whiskey closed 4 years ago

tangoed2whiskey commented 5 years ago

I'm attempting to use smurff to do out-of-matrix predictions. I've followed the syn_out_matrix_prediction notebook, but am now having difficulty interpreting the output.

I was expecting that if I run the pred_out_of_matrix function with the side information of the training data, this should be the same as using the predict_all() method directly on the predictor. Eventually I'll want to use new side data, but I'm starting off with something we should know the answer for.

However, the script below gives different results using the pred_out_of_matrix function and using the predict_all() method. Both results also don't appear particularly accurate.

Would it be possible to explain what I'm doing wrong with this test?

Code and sample output below.

import data_simulation as sim 
import numpy as np
from scipy.sparse import coo_matrix
import smurff

ds = sim.gen_matrix(1000,400,320,4)
sparse_matrix = sim.sparsify(ds['matrix'], sparsity = 0.2)

train_indices = np.random.choice(sparse_matrix.shape[0], round(0.8*sparse_matrix.shape[0]),replace=False)
test_indices = np.setdiff1d(np.arange(sparse_matrix.shape[0]), train_indices)
train_indices = np.sort(train_indices)
test_indices = np.sort(test_indices)

train_ds = sparse_matrix[train_indices,]
train_fea = ds['sinfo'][train_indices,]
test_ds = sparse_matrix[test_indices,]
test_fea = ds['sinfo'][test_indices,]

print('Actual mean of train data                             : {:.5f}'.format(np.mean(test_ds)))
print('')

sp_train_ds = coo_matrix(train_ds)
sp_train_fea = coo_matrix(train_fea)

sp_train_ds1, sp_train_ds2 = smurff.make_train_test(sp_train_ds, 0.01)

session = smurff.MacauSession(  Ytrain     = sp_train_ds1,
                                Ytest      = sp_train_ds2,
                                side_info  = [sp_train_fea,None],
                                num_latent = 4,
                                burnin     = 100,
                                nsamples   = 1000,
                                save_freq  = 100,
                                save_prefix="test/save",
                                verbose    = 0)

predictions = session.run()

predictor = session.makePredictSession()

def predict_out_of_matrix_1s(side_info_matrix, sample_predictor):
    """Out-of-matrix prediction using one sample

    Args:
        side_info_matrix: numpy side info matrix
        sample_predictor: Smurff sample object

    Returns:
        numpy fully predicted matrix

    """
    U, V = sample_predictor.latents
    Umean, Vmean = sample_predictor.latent_means
    Ubeta, Vbeta = sample_predictor.betas

    wU = side_info_matrix.dot(Ubeta.transpose()) + Umean
    m  = np.matmul(wU, V)

    return m

def pred_out_of_matrix(side_info_matrix, predictor):
    """Out-of-matrix prediction using all of the samples

    Args:
        side_info_matrix: numpy side info matrix
        predictor:        Smurff PredictSession

    Returns:
        numpy fully predicted matrix (obtained by averaging)

    """

    predictions = np.array([predict_out_of_matrix_1s(side_info_matrix, s) for s in predictor.samples])

    return predictions.mean(axis = 0)

pred_test_ds = pred_out_of_matrix(train_fea, predictor)
pred_test_ds2 = predictor.predict_all()[0]
print('Predicted mean of train data using pred_out_of_matrix: {:.5f}'.format(np.mean(pred_test_ds)))
print('Predicted mean of train data using .predict_all      : {:.5f}'.format(np.mean(pred_test_ds2)))

which gives output

Actual mean of train data                            : 0.02434

Predicted mean of train data using pred_out_of_matrix: 0.10887
Predicted mean of train data using .predict_all      : 0.13461
tvandera commented 5 years ago

Hi Tom,

Thanks for your excellent report. I'll need some time to look into it. I'm not the expert on the algorithm, or on out-of-matrix prediction.

@thanhlv @jaak-s: can you cast an eye on the above?

tvandera commented 5 years ago

I got this answer from Thanh:

Hi Tom,

The out-of-matrix prediction validation in the notebook can be indeed validated by using the colors of the implanted bi-clusters. We created a dataset, in which we implanted four diagonal biclusters. Then, we randomly sampled 80% of the data for training and 20% for testing. The sampled data for the train set and the test are sorted by index; hence the structure and the color of the bi-clusters are more or less similar. That is, if we trained a SMURFF on the train set, we make the out-of-matrix prediction using the row features of the test set, the resulting predicted matrix should have the similar color/structures as the ones in the orginal matrix.

The pred_out_of_matrix() function performs out-of-matrix prediction while the predict_all(), provided by SMRUFF API, performs in-matrix prediction. Hence, they are not the same.

Hope this helps. If there is something unclear, please feel free to let me know.

Best regards, Thanh

tangoed2whiskey commented 5 years ago

Thanks for the reply; I'm afraid I still don't entirely understand the difference between in- and out-of-matrix predictions. As far as I can tell, when making predictions there is effectively a function that maps the side information onto the latent space, and then predictions are made using this reduced-dimensional matrix. I can't see why this is different when the test examples are in or out of the original matrix: whether they have been trained on is the key property (if trained on, should be well predicted), but using the training side data to make in- or out-of-matrix should be the same?

I know I'm missing something here, if you would be able to clarify further that would be great!

tvandera commented 5 years ago

Hey Tom,

to be honest: I also do not understand the notebook Thanh made.

But what you describe about out-of-matrix predictions is correct and is supported using python in SMURFF (see https://smurff.readthedocs.io/en/latest/notebooks/inference_with_smurff.html#Make-predictions-using-side-information)

The code in SMURFF that implements this is a bit complicated: https://github.com/ExaScience/smurff/blob/master/python/smurff/smurff/predict.py#L90

But the original implementation by Jaak is much clearer: https://macau.readthedocs.io/en/latest/source/saving_models.html#using-the-saved-model-to-predict-new-rows-compounds

If you want we can do a conf call where I explain you how this works.

Cheers, Tom

tangoed2whiskey commented 5 years ago

Hi Tom,

Thanks very much for that, that was really helpful! I think I now understand the problem much better: however, I still have not managed to work out why predict_all is working differently to the outside-matrix prediction. I have hacked together my own method like the predict_one method you highlighted which makes predictions for many examples at a time:

    def predict_many(self, coords, value = float("nan")):
        ret=[]
        for coord in coords:
            p = Prediction(coord, value)
            for s in self.samples:
                p.add_sample(s.predict(p.coords))
            ret.append(p)
        return [r.pred_all for r in ret]

Using this as

places = [(sinfo, col) for sinfo in train_fea for col in range(train_ds.shape[1])]
pred_test_ds3 = np.mean(predictor.predict_many(places),axis=0)

the constructed pred_test_ds3 gives exactly the same as the pred_out_of_matrix function (as one would hope, as they do the same thing!). However this is not the same as the predict_all method gives, and in my tests the predict_all method is considerably more accurate. This makes me wary of using the pred_out_of_matrix (or predict_many) function in anger, as it can't pass the simple test of giving the same answer as what should be a comparable method.

I hope that makes clear what I'm concerned about!

Best wishes, Tom

tvandera commented 5 years ago

Out of matrix prediction is not using the train matrix, only side info. predict_all is using the train data and the side info. Does this make sense?

tangoed2whiskey commented 5 years ago

I'm sorry, I'm still not quite understanding this: exactly what extra information does the predict_all method have access to, what is it using from the matrix of training data? I can't tell from the code, and especially can't tell why this same information shouldn't be applicable on out-of-matrix predictions (with some caveats of course).

tvandera commented 5 years ago

where R is the rating matrix

tvandera commented 5 years ago

Hi Tom,

I found a bug in the out-of-matrix prediction code. See #120.

T.

tangoed2whiskey commented 5 years ago

Cheers Tom, thanks for the heads-up; I'll definitely take another look at using the out-of-matrix predictions when that's sorted. I assume there isn't an easy fix I could implement quickly?

tvandera commented 5 years ago

The fix has been implemented and I’m currently testing it. I’m also planning on creating a better explanatory notebook.

tangoed2whiskey commented 5 years ago

Brilliant! Looking forward to trying it soon then

tvandera commented 4 years ago

Hi Tom. after some fixes in your code, it seems to work out:

#!/usr/bin/env python
# coding: utf-8

# In[ ]:

def predict_out_of_matrix_1s(side_info_matrix, sample_predictor):
    """Out-of-matrix prediction using one sample

    Args:
        side_info_matrix: numpy side info matrix
        sample_predictor: Smurff sample object

    Returns:
        numpy fully predicted matrix

    """
    U, V = sample_predictor.latents
    Umu, Vmu = sample_predictor.mus
    Ubeta, Vbeta = sample_predictor.betas

    wU = side_info_matrix.dot(Ubeta.transpose()) + Umu
    m  = np.matmul(wU, V)

    return m

def pred_out_of_matrix(side_info_matrix, predictor):
    """Out-of-matrix prediction using all of the samples

    Args:
        side_info_matrix: numpy side info matrix
        predictor:        Smurff PredictSession

    Returns:
        numpy fully predicted matrix (obtained by averaging)

    """

    predictions = np.array([predict_out_of_matrix_1s(side_info_matrix, s) for s in predictor.samples()])

    return predictions.mean(axis = 0)

# In[ ]:

import data_simulation as sim 
import numpy as np
from scipy.sparse import coo_matrix
import smurff

ds = sim.gen_matrix(1000,400,320,4)
sparse_matrix = sim.sparsify(ds['matrix'], sparsity = 0.2)

print("Main matrix: ", sparse_matrix.shape)

train_indices = np.random.choice(sparse_matrix.shape[0], round(0.8*sparse_matrix.shape[0]),replace=False)
test_indices = np.setdiff1d(np.arange(sparse_matrix.shape[0]), train_indices)
train_indices = np.sort(train_indices)
test_indices = np.sort(test_indices)

train_ds = sparse_matrix[train_indices,]
train_fea = ds['sinfo'][train_indices,]
test_ds = sparse_matrix[test_indices,]
test_fea = ds['sinfo'][test_indices,]

sp_train_ds = coo_matrix(train_ds)
sp_train_fea = coo_matrix(train_fea)
sp_test_ds = coo_matrix(test_ds)

sp_train_ds1, sp_train_ds2 = smurff.make_train_test(sp_train_ds, 0.1)

print("Validation matrix:", sp_test_ds.shape)
print("Train matrix:", sp_train_ds1.shape)
print("Test matrix:", sp_train_ds2.shape)

print('Actual mean of validation data                       : {:.5f}'.format(np.mean(sp_test_ds.data)))
print('Actual mean of train data                            : {:.5f}'.format(np.mean(sp_train_ds1.data)))
print('Actual mean of test data                             : {:.5f}'.format(np.mean(sp_train_ds2.data)))

session = smurff.MacauSession(  Ytrain     = sp_train_ds1,
                                Ytest      = sp_train_ds2,
                                side_info  = [sp_train_fea,None],
                                num_latent = 32,
                                burnin     = 100,
                                nsamples   = 400,
                                save_freq  = 1,
                                save_prefix=".",
                                verbose    = 1,
                                direct     = True)

predictions = session.run()
predictor = session.makePredictSession()

# In[ ]:

pred_test_ds = pred_out_of_matrix(train_fea, predictor)
pred_test_ds2 = predictor.predict_all()
print('Predicted mean of train data using pred_out_of_matrix: {:.5f}'.format(np.mean(pred_test_ds)))
print('Predicted mean of train data using .predict_all      : {:.5f}'.format(np.mean(pred_test_ds2)))
tvandera commented 4 years ago

Now I get the output:

Main matrix:  (1000, 400)
Validation matrix: (200, 400)
Train matrix: (800, 400)
Test matrix: (800, 400)
Actual mean of validation data                       : 0.10944
Actual mean of train data                            : 0.11539
Actual mean of test data                             : 0.17850

Predicted mean of train data using pred_out_of_matrix: 0.13512
Predicted mean of train data using .predict_all      : 0.13544
tvandera commented 4 years ago

Feel free to re-open if you want.