I noticed something weird when I was using this function to split my data into train and test set
I had a distribution of users and number of times they have rated items looking something like this:
Number of ratings given
Number of users
1
6000
2
3000
3
200
4
30
The documentation states that users > K ratings have one of their rating put into test set, and the others in the train set.
So when I used the function with k = 1 I was expecting to get 3230 records in the test set, but only got 230
I have a guess that it was done this way to prevent situation where user with 2 ratings gets only 1 rating in the train set, because If I understand it correctly users with 1 rating are useless for training? Please verify
I noticed something weird when I was using this function to split my data into train and test set I had a distribution of users and number of times they have rated items looking something like this:
The documentation states that users > K ratings have one of their rating put into test set, and the others in the train set. So when I used the function with k = 1 I was expecting to get 3230 records in the test set, but only got 230
So my question is shoudln't this line then https://github.com/benfred/implicit/blob/6491663bb0b63f0c5eac3843312701a5f38d1e79/implicit/evaluation.pyx#L189 look like this
or this
instead ?
I have a guess that it was done this way to prevent situation where user with 2 ratings gets only 1 rating in the train set, because If I understand it correctly users with 1 rating are useless for training? Please verify