Correct way to create a (train, test) csr_matrix

benfred / implicit

Fast Python Collaborative Filtering for Implicit Feedback Datasets

https://benfred.github.io/implicit/

MIT License

3.57k stars 612 forks source link

Correct way to create a (train, test) csr_matrix #675

Closed Blo0dR0gue closed 1 year ago

Blo0dR0gue commented 1 year ago

Hi @benfred,

If I want to create a csr_matrix from a pandas dataframe which has the columns user_id, item_id and count, and user_id and item_id are unique (numeric) identifiers for a corresponding product or customer, can I then use these user_id and item_id directly for row_ind and col_ind, even if user_id and item_id are not consecutive from 0 to X? As in csr_array((df['count'], (df['user_id'], df['item_id'])) The reason the IDs are not sequential is that I do a split of the data based on groups within the transactions. This may result in missing user_ids or item_ids in the test or training dataset. The IDs are generated before the split so that the products and customers have the same IDs in the training and test datasets. Or do the IDs need to be sequential from 0 to X, so I need to generate the IDs for the test dataset first, then map them to the training dataset, and then generate the remaining missing IDs for the training dataset? Thanks for your help.

admivsn commented 1 year ago

Hi @Blo0dR0gue,

A couple of things:

It may be better to use scipy.sparse.coo_matrix instead of scipy.sparse.coo_array, I know I ran into some issues further downstream when using the later.
Your user_id and item_id that you feed into scipy.sparse.coo_matrix are interpreted as the indexes (see this code below and notice that the resulting matrix is 3 x 5 matrix). Additionally, examine the code here and you can see that the shape of the train and test matrices comes from the shape of the input matrix, which equates to the IDs as they are the indexes.

import pandas as pd
from implicit.evaluation import train_test_split
from scipy import sparse

df = pd.DataFrame({
    "user_id": [1, 2],
    "game_id": [3, 4],
    "rating": [5, 6]
})

sparse_matrix = sparse.coo_matrix(
    (df.rating.astype(float), (df.user_id, df.game_id)),
)

sparse_matrix # a 3 x 5 matrix
train_test_split(sparse_matrix) # two 3 x 5 matrices

Hope something in there helps!

Blo0dR0gue commented 1 year ago

Hi @admivsn,

Thank you for your reply. It definitely helped me and is in line with my research and tests from last week. The IDs actually do not have to be numbered consecutively from 0 to X. As a test, I had the IDs generated as in the question and only got a score that differed by 0.0x%. However, this has something to do with the different order of the users or items. If I swap the user IDs randomly, I also get a different value each time. Due to time constraints, I have not yet been able to find out the exact reason for this, but it is not really important as the difference is really minimal. And using the coo_matrix is also a good idea, if only for the reason that it can be constructed more quickly than a csr_matrix.

Thanks again for your reply.