ExaScience / smurff

Bayesian Factorization with Side Information in C++ with Python wrapper
MIT License
70 stars 14 forks source link

use binary matrix as input #123

Closed nbosc closed 5 years ago

nbosc commented 5 years ago

In your example you show how to factorise a binary matrix but you actually binarise the matrix during the factorisation. Because I apply different thresholds depending on the data, as far as I know I cannot use smurff.ProbitNoise. Therefore I have already precomputed my binary matrix and now would like to run the factorisation but I get an error doing:

session = smurff.TrainSession(
    priors = ['normal', 'normal'],
    num_latent = 16,
    burnin     = 40,
    nsamples   = 100,
    verbose    = 1,)

session.addTrainAndTest(train, test)

predictions = session.run()

Which is:

ValueError                                Traceback (most recent call last)
wrapper.pyx in smurff.wrapper.prepare_train_and_test()

ValueError: Train and test data must be the same shape: (172108, 572) != (43028, 572)

The above exception was the direct cause of the following exception:

SystemError                               Traceback (most recent call last)
<ipython-input-67-15d256907aaa> in <module>()
      6     verbose    = 1,)
      7 
----> 8 session.addTrainAndTest(train, test)
      9 
     10 predictions = session.run()

SystemError: <built-in method addTrainAndTest of smurff.wrapper.TrainSession object at 0x126047a50> returned a result with an error set
tvandera commented 5 years ago

Hi train.shape and test.shape should be equal. Can you check?

nbosc commented 5 years ago

Hi, Train and test had different shapes because I wad doing a 80/20 split. Changing to 50/50 solves the problem, thanks. I don't get why the shapes have to be identical, could you explain or tell me where to look for?

tvandera commented 5 years ago

In most cases train and test are sparse matrices (scipy.sparse), they have different non-zero elements, but their number of rows and columns should be equal.

You could have a dense train matrix, but then your test set is always going to overlap with your train set.

tvandera commented 5 years ago

It's OK to binarize the matrix upfront, but do check what threshold SMURFF is using. This is printed at the beginning of sampling.

nbosc commented 5 years ago

Ok, thanks for the precision. It seems to work. My binary pre-computed values are -1. and 1. so I use a 0. threshold. I get this:

Result: {
    Test data: 168002 [92557 x 572] (0.32%)
    Binary classification threshold: 0.00
      40.68% positives in test data
tvandera commented 5 years ago

Looks ok.