Question about leave_k_out function

Deemjan commented 2 years ago

I noticed something weird when I was using this function to split my data into train and test set I had a distribution of users and number of times they have rated items looking something like this:

Number of ratings given	Number of users
1	6000
2	3000
3	200
4	30

The documentation states that users > K ratings have one of their rating put into test set, and the others in the train set. So when I used the function with k = 1 I was expecting to get 3230 records in the test set, but only got 230

So my question is shoudln't this line then https://github.com/benfred/implicit/blob/6491663bb0b63f0c5eac3843312701a5f38d1e79/implicit/evaluation.pyx#L189 look like this

candidate_mask = counts >= K + 1

or this

candidate_mask = counts > K

instead ?

I have a guess that it was done this way to prevent situation where user with 2 ratings gets only 1 rating in the train set, because If I understand it correctly users with 1 rating are useless for training? Please verify

ita9naiwa commented 2 years ago

yes, it looks it's bug and it must be fixed.

ita9naiwa commented 2 years ago

I'm sorry, it's intended.

benfred / implicit

Question about leave_k_out function #571