ExaScience / smurff

Bayesian Factorization with Side Information in C++ with Python wrapper
MIT License
70 stars 14 forks source link

question on accessing the prediction results from `Macau` module #142

Closed FanwangM closed 3 years ago

FanwangM commented 3 years ago

Sorry to bother again, but I do appreciate the efforts of maintaining this nice package and it's very helpful in my research.

I am trying to using smurff to predict missing values along with side information. But I get a little confused about the predictions from trainSession, https://github.com/ExaScience/smurff/blob/master/docs/notebooks/inference_with_smurff.ipynb. In the notebook, when doing predictions after training the model, I noticed that we will get 10 predictions for a sample. For example, I would expect to have only one prediction when Predict just one element, but we got a list of 10 elements. I know this might be related to save_freq in https://github.com/ExaScience/smurff/blob/29c3859badca49275833024cd77f8ca7fa6f76be/python/smurff/trainsession.py#L49, which I don't understand.

Besides accessing the predicted mean value, how can we get the predicted variance/std for the training/testing data?

Thank you for taking your valuable time to help me! @tvandera

FanwangM commented 3 years ago

One follow-up question is how can I get the predictions (fitting to be more precise) on training data and the corresponding variance/std from trainSession? Thank you.

Update: Information I need to tune the hyperparameters with grid search and cross-validation,

I have figured out that how to get testing data related items,

# train model based on https://smurff.readthedocs.io/en/release-0.17/notebooks/inference_with_smurff.html#Making-predictions-from-a-TrainSession

# inspired by
# https://github.com/oaoni/sklearn-bpmf/blob/main/sklearn_bpmf/core/wrapper.py
predictions_test =trainSession.getTestPredictions()

# pred_avg_test: predicted average values on test data
pred_avg_test = np.array([p.pred_avg for p in predictions])

# pred_var_test: predicted average values on test data
pred_var_test = np.array([np.var(p.pred_all) for p in predictions_test])

# pred_var_test: predicted average values on test data
pred_std_test = np.array([np.sqrt(p) for p in pred_var_test])

The only concern that I have is that how can I access the predictions without saving an hdf5 file? Even I set save_name=None, I will have an output.hdf5 according to the codes in smurff.

I can have access to the train RMSE and test RMSE, but I don't know how to get the predictions on training data as well as variance/std of predicted traning data.

status = trainSession.getStatus()
# train RMSE
rmse_train = status.train_rmse
# test RMSE (**Not sure if this is right**)
rmse_test = status.rmse_avg

Thank you again for your time! @tvandera @thanhlv @jaak-s

tvandera commented 3 years ago

Hi, let me try to answer some of you questions.

SMURFF uses Gibbs sampling to sample from the Posterior Distribution. You set the number of samples generated with the nsamples parameter in smurff.TrainSession. The save_freq (also in smurff.TrainSession) are the number of samples saved to disk, in an HDF5 file. This HDF5 file is the only data that persists after a train session. Samples are not kept in memory. If you do not explicitely set a save_name, a temporary HDF5 file is created.

The test RMSE calculated during training is the RMSE for all elements in the Ytest set. For each point in Ytest a prediction is made by averaging across all Gibbs samples. Hence the difference between rmse_avg and rmse_1sample.

You need the HDF5 if you want to make predictions afterwards. (See smurff.PredictSession). You will get predictions as a list of Gibbs samples with as many samples as were saved in the HDF5 file.

I hope this helps, Tom

FanwangM commented 3 years ago

Hi Tom! Your comments make a lot of sense to me. Thank you! @tvandera I enjoyed reading the materials you sent.

Based on what you just mentioned, the final is a list of Gibbs samples. One question here is why the RMSE of training data is not as the same as reported in trainSession.getStatus() with the following example (modified from inference_with_smurff.ipynb),

import os
import logging

import smurff
from scipy.sparse import find
from sklearn.metrics import mean_squared_error

ic50_train, ic50_test, ecfp = smurff.load_chembl()

# limit to 100 rows and 100 features to make thinks go faster
ic50_train = ic50_train.tocsr()[:100,:]
ic50_test = ic50_test.tocsr()[:100,:]
ecfp = ecfp.tocsr()[:100,:].tocsc()[:,:100]

# indices for testing data
(i_test,j_test,v_test) = find(ic50_test)
# indices for training data
(i_train,j_train,v_train) = find(ic50_train)

# train the model
# here I set save_freq  = 1 to avoid precision loss
trainSession = smurff.MacauSession(
                       Ytrain     = ic50_train,
                       Ytest      = ic50_test,
                       side_info  = [ecfp, None],
                       num_latent = 16,
                       burnin     = 200,
                       nsamples   = 100,
                       save_freq  = 1,
                       save_name  = "ic50-macau.hdf5",
                       verbose    = 2,)
predictions = trainSession.run()

# predictions for all the matrix elements
predictor = trainSession.makePredictSession()
p_all = predictor.predict_all()
# averaged predictions for all the matrix elements
p_all_avg = sum(p_all) / len(p_all)

# test rmse
status = trainSession.getStatus()
test_rmse_smurff = predictor.statsYTest["rmse_avg"]
test_rmse_pred = mean_squared_error(v_test, p_all_avg[i_test, j_test], squared=False)
print("testing RMSE from smurff: ", test_rmse_smurff)
print("testing RMSE from my experiment: ", test_rmse_pred)

# train rmse
status = trainSession.getStatus()
train_rmse_smurff = status.train_rmse
train_rmse_pred = mean_squared_error(v_train, p_all_avg[i_train, j_train], squared=False)
print("training RMSE from smurff: ", train_rmse_smurff)
print("training RMSE from my experiment: ", train_rmse_pred)

The output is

# RMSE for testing data
testing RMSE from smurff:  0.8699444107584667
testing RMSE from my experiment:  0.8699444107584667
# RMSE for training data
training RMSE from smurff:  0.4158096745563251
training RMSE from my experiment:  0.18156834235475147

I got the same RMSE value for testing data, but not for training data. Therefore, I would assume there is something wrong with my codes here. Is predictor.predict_all() the right way to get everything from the predictions, including predictions for training data (fitting actually), testing data and unobserved data? The reason I do it in such a way is that I will need predictions on testing and training data for further analysis (uncertainty characterizations and hyperparameter optimizations).

One more question is can I use p = predictor.predict_sparse(ic50_train) to get predictions on traning data?

Thank you for your patience and I do appreciate your help!

Fanwang

tvandera commented 3 years ago

Hi,

Indeed something is wrong, either train_rmse or test_rmse. I will investigate this further and let you know.

Both predict_all and predict_sparse should work the way you show in your code, and should give the same result.

T.

tvandera commented 3 years ago

It seems the training RMSE computed by SMURFF is the RMSE for the last Gibbs sample

Changing these lines in your code:

# train rmse
status = trainSession.getStatus()
train_rmse_smurff = status.train_rmse
train_rmse_pred = mean_squared_error(v_train, p_all_avg[i_train, j_train], squared=False)
train_rmse_last_sample = mean_squared_error(v_train, p_all[-1][i_train, j_train], squared=False)
print("training RMSE from smurff: ", train_rmse_smurff)
print("training RMSE from my experiment (all samples): ", train_rmse_pred)
print("training RMSE from my experiment (last sample): ", train_rmse_last_sample)

Results in:

testing RMSE from smurff:  0.9051012624682462
testing RMSE from my experiment:  0.905101262468246
training RMSE from smurff:  0.40754707127661227
training RMSE from my experiment (all samples):  0.19372401698540637
training RMSE from my experiment (last sample):  0.40754707127661227
FanwangM commented 3 years ago

Then what should I use for the training error, the average of all the Gibbs samples or just the last sample? I get confused about which sample to use to be interpreted as final predictions, all the Gibbs samples or only the last sample.

The original paper reports the error of the last sample. But the average of all samples seems to make sense to me.

Thanks, Tom. @tvandera

With kind regards, Fanwang

tvandera commented 3 years ago

Hi,

I would say error on the training data is largely irrelevant, except maybe to detect over-fitting. Also, I'm pretty sure Jaak reports the RMSE for the test set on the predictions averaged across all samples. Code here

FanwangM commented 3 years ago

Thanks for the reply. I was intended to use training and testing errors for cross-validation and hyperparameters to avoid overfitting. So, I am going to use all samples to get averaged errors for training and testing for my study.

Thanks again and I am going to close this issue for now.