benfred / implicit

Fast Python Collaborative Filtering for Implicit Feedback Datasets
https://benfred.github.io/implicit/
MIT License
3.57k stars 612 forks source link

Question about leave_k_out function #571

Open Deemjan opened 2 years ago

Deemjan commented 2 years ago

I noticed something weird when I was using this function to split my data into train and test set I had a distribution of users and number of times they have rated items looking something like this:

Number of ratings given Number of users
1 6000
2 3000
3 200
4 30

The documentation states that users > K ratings have one of their rating put into test set, and the others in the train set. So when I used the function with k = 1 I was expecting to get 3230 records in the test set, but only got 230

So my question is shoudln't this line then https://github.com/benfred/implicit/blob/6491663bb0b63f0c5eac3843312701a5f38d1e79/implicit/evaluation.pyx#L189 look like this

candidate_mask = counts >= K + 1 

or this

candidate_mask = counts > K

instead ?

I have a guess that it was done this way to prevent situation where user with 2 ratings gets only 1 rating in the train set, because If I understand it correctly users with 1 rating are useless for training? Please verify

ita9naiwa commented 2 years ago

yes, it looks it's bug and it must be fixed.

ita9naiwa commented 2 years ago

I'm sorry, it's intended.