Closed Blo0dR0gue closed 1 year ago
Hi @Blo0dR0gue,
A couple of things:
It may be better to use scipy.sparse.coo_matrix
instead of scipy.sparse.coo_array
, I know I ran into some issues further downstream when using the later.
Your user_id
and item_id
that you feed into scipy.sparse.coo_matrix
are interpreted as the indexes (see this code below and notice that the resulting matrix is 3 x 5 matrix). Additionally, examine the code here and you can see that the shape of the train and test matrices comes from the shape of the input matrix, which equates to the IDs as they are the indexes.
import pandas as pd
from implicit.evaluation import train_test_split
from scipy import sparse
df = pd.DataFrame({
"user_id": [1, 2],
"game_id": [3, 4],
"rating": [5, 6]
})
sparse_matrix = sparse.coo_matrix(
(df.rating.astype(float), (df.user_id, df.game_id)),
)
sparse_matrix # a 3 x 5 matrix
train_test_split(sparse_matrix) # two 3 x 5 matrices
Hope something in there helps!
Hi @admivsn,
Thank you for your reply. It definitely helped me and is in line with my research and tests from last week. The IDs actually do not have to be numbered consecutively from 0 to X. As a test, I had the IDs generated as in the question and only got a score that differed by 0.0x%. However, this has something to do with the different order of the users or items. If I swap the user IDs randomly, I also get a different value each time. Due to time constraints, I have not yet been able to find out the exact reason for this, but it is not really important as the difference is really minimal. And using the coo_matrix is also a good idea, if only for the reason that it can be constructed more quickly than a csr_matrix.
Thanks again for your reply.
Hi @benfred,
If I want to create a csr_matrix from a pandas dataframe which has the columns user_id, item_id and count, and user_id and item_id are unique (numeric) identifiers for a corresponding product or customer, can I then use these user_id and item_id directly for row_ind and col_ind, even if user_id and item_id are not consecutive from 0 to X? As in csr_array((df['count'], (df['user_id'], df['item_id'])) The reason the IDs are not sequential is that I do a split of the data based on groups within the transactions. This may result in missing user_ids or item_ids in the test or training dataset. The IDs are generated before the split so that the products and customers have the same IDs in the training and test datasets. Or do the IDs need to be sequential from 0 to X, so I need to generate the IDs for the test dataset first, then map them to the training dataset, and then generate the remaining missing IDs for the training dataset? Thanks for your help.