benfred / implicit

Fast Python Collaborative Filtering for Implicit Feedback Datasets
https://benfred.github.io/implicit/
MIT License
3.57k stars 612 forks source link

Getting poor performance with partial_fit_items #682

Closed franzoni315 closed 1 year ago

franzoni315 commented 1 year ago

Hi! I am using this library in order to get user and item embeddings, which will be used in a clustering algorithm in order to find clusters of users according the their preferences. I already have some good results with these clusters, and now I am trying to make this work with new items and users.

From what I researched, the partial_fit_items (or partial_fit_users) are functions designed to update the factors only for a subset of items or users. This is great, because I want to keep the factors frozen for all items, except for the new ones, since this will guarantee the stability of my embeddings during time.

The problem that I am dealing with is that the partial_fit_items function is not positioning the new item in a meaningful location inside the embedding space. I did the following test to verify this:

  1. I trained a first model with all data that I have -> (model_1)
  2. Select one item that will be removed from the training data.
  3. I trained a second model with data from step 2 -> (model_2)
  4. I updated model_2 with partial_fit_items with the item that was previously removed -> (updated_model_2)
  5. Compare the embeddings produced by model_1 and updated_model_2 for the selected item. I did this in two ways: find the nearest neighbors of the test item, and also reduce the embeddings to 2d in order to have a visual inspection.

I developed a helper function that can update the factors of new items and new users, which is used in step 4:

def partial_fit(model, X, user_ids, item_ids, iterations):
    """
    Updates the item and user factors for selected user and item ids.

    model: previously trained model
    X: sparse matrix of user-item interactions. This contain all interactions, including the ones used in training and also the new/changed items and new/changed users.
    user_ids: list with user_ids whose factors will be updated
    item_ids: list with item_ids whose factors will be updated
    iterations: integer; run the partial_fit for multiple times (maybe this helps convergence?)
    """
    for i in range(iterations):
        model.partial_fit_items(item_ids, X[:, item_ids].transpose())
        model.partial_fit_users(user_ids, X[user_ids])
    return updated_model

I expected that the factors produced by model_1 and updated_model_2 for the selected item would be close to the same items. For instance, imagine that I have data from movies, and there is a subset with Rocky Balboa's movies. With model_1 all Rocky's movies are close to each other, but with updated_model_2, if Rocky I was the selected movie for the test, it falls on a weird position on the embedding space, and the other Rocky movies are no longer closer to it.

Can someone help me to understand if my test procedure makes sense? It would be amazing to make this to work and put this in production! Thanks!

franzoni315 commented 1 year ago

I found out the problem. The X[:, item_ids].transpose() was returning a CSC matrix instead of a CSR (maybe this package could throw an error or a warning if something other than a CSR matrix is passed). Casting it to CSR solved the problem:

X[:, [test_item_id]].transpose().tocsr()

I also created an example notebook with Google Colab showing that the test that I proposed works as expected with the Movie Lens Dataset. It might be of help to anyone trying to use partial_fit_items function.

benfred commented 1 year ago

Passing a CSC would explain the poor results! I'm glad you figured it out. I've added some checks in #683 that will detect if a non-csr matrix is passed to the partial_fit methods - so hopefully other people won't hit this same issue.

I also created an example notebook with Google Colab showing that the test that I proposed works as expected with the Movie Lens Dataset. It might be of help to anyone trying to use partial_fit_items function.

Thanks for creating that notebook! I do have a couple questions about this though:

The most similar items from the updated model are very close to the original model, although the order (by distance) may vary a little bit. Also the distance values are incredibly large for some reason.

Having the distance values be incredibly large seems like a bug =(. I wonder if this is because of issues with the item norms, which are cached between similar_items calls. Can you try with calling new_model._item_norms = new_model._user_norms = None right after calling partial_fit_items to see if this resolves? If this is the root cause, I'll put together a proper fix.

CUDA extension is built, but disabling GPU support because of 'Cuda Error: CUDA driver version is insufficient for CUDA runtime version

Do you know what version of the CUDA driver/runtime is on this machine? (I work at nvidia, so feel like this really is something that I should get working =) )

franzoni315 commented 1 year ago

I tried to remove the cache as you proposed and it worked wonderfully! The distances (or similarities) are very close among the tests employed on the notebook.

Regarding to the CUDA error, did you get this when you tried to use the GPU on the Google Colab instance? I did a quick test here, just by changing the runtime to use a T4 GPU, and used the ALS version for GPU, and the code ran a lot faster!

Also the CUDA version is:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
benfred commented 1 year ago

Thanks @franzoni315 - I've added a fix here to automatically clear the norms https://github.com/benfred/implicit/pull/685, and will be in the next release, which I will push out later today.

For the CUDA issue, I just tried it out on colab - and it looks like the warning message disabling GPU support because of 'Cuda Error: CUDA driver version is insufficient for CUDA runtime version is only shown on instances without a GPU, and if you select a runtime w/ a GPU it works. It seems like the issue is the CPU runtime still has CUDA installed, but doesn't have a nvidia driver - which means we get this weird error message. It's not ideal , but I can live with it.