benfred / implicit

Fast Python Collaborative Filtering for Implicit Feedback Datasets
https://benfred.github.io/implicit/
MIT License
3.57k stars 612 forks source link

leave_k_out_split produces incorrect values in test set #639

Open chrisjkuch opened 1 year ago

chrisjkuch commented 1 year ago

I noticed odd behavior in leave_k_out_split - specifically, under certain circumstances (many rows with one value?), the number returned for the withheld test set value is different from the actual value. This causes train + test to fail to reconstruct the original matrix.

Script to reproduce:

import implicit

implicit.__version__
#> '0.6.2'

from implicit.evaluation import leave_k_out_split
from scipy import sparse

ratings = sparse.csr_matrix(
    [[3, 2, 1, 0], [1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 1, 1, 1]]
)

train, test = leave_k_out_split(ratings, K=1, random_state=42)

diff = (train + test) - ratings

diff.nnz
#> 1

train.todense()
#> matrix([[3, 2, 0, 0],
#>         [1, 0, 0, 0],
#>         [0, 1, 0, 0],
#>         [0, 0, 1, 0],
#>         [0, 1, 0, 1]])

test.todense()
#> matrix([[0, 0, 4, 0],
#>         [0, 0, 0, 0],
#>         [0, 0, 0, 0],
#>         [0, 0, 0, 0],
#>         [0, 0, 1, 0]])

diff.todense()
#> matrix([[0, 0, 3, 0],
#>         [0, 0, 0, 0],
#>         [0, 0, 0, 0],
#>         [0, 0, 0, 0],
#>         [0, 0, 0, 0]])

Created at 2022-12-23 13:59:28 CST by reprexlite v0.5.0

I believe the issue is in _take_tails - the returned test_idx array has multiple copies of the first user index returned, so we end up with a test set value that is copied multiple times.

(As an aside, the call to _take_tails when shuffled=True does not pass on the rng, so the random state cannot be maintained.)

chrisjkuch commented 1 year ago

Existing test can also be made to fail occasionally if density is set below 0.1:

https://github.com/benfred/implicit/blob/871e0c7229b012108131b6211cd617e23a3b24bf/tests/evaluation_test.py#L18