Closed franzoni315 closed 1 year ago
I found out the problem. The X[:, item_ids].transpose()
was returning a CSC matrix instead of a CSR (maybe this package could throw an error or a warning if something other than a CSR matrix is passed). Casting it to CSR solved the problem:
X[:, [test_item_id]].transpose().tocsr()
I also created an example notebook with Google Colab showing that the test that I proposed works as expected with the Movie Lens Dataset. It might be of help to anyone trying to use partial_fit_items
function.
Passing a CSC would explain the poor results! I'm glad you figured it out. I've added some checks in #683 that will detect if a non-csr matrix is passed to the partial_fit methods - so hopefully other people won't hit this same issue.
I also created an example notebook with Google Colab showing that the test that I proposed works as expected with the Movie Lens Dataset. It might be of help to anyone trying to use partial_fit_items function.
Thanks for creating that notebook! I do have a couple questions about this though:
The most similar items from the updated model are very close to the original model, although the order (by distance) may vary a little bit. Also the distance values are incredibly large for some reason.
Having the distance values be incredibly large seems like a bug =(. I wonder if this is because of issues with the item norms, which are cached between similar_items calls. Can you try with calling new_model._item_norms = new_model._user_norms = None
right after calling partial_fit_items
to see if this resolves? If this is the root cause, I'll put together a proper fix.
CUDA extension is built, but disabling GPU support because of 'Cuda Error: CUDA driver version is insufficient for CUDA runtime version
Do you know what version of the CUDA driver/runtime is on this machine? (I work at nvidia, so feel like this really is something that I should get working =) )
I tried to remove the cache as you proposed and it worked wonderfully! The distances (or similarities) are very close among the tests employed on the notebook.
Regarding to the CUDA error, did you get this when you tried to use the GPU on the Google Colab instance? I did a quick test here, just by changing the runtime to use a T4 GPU, and used the ALS version for GPU, and the code ran a lot faster!
Also the CUDA version is:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
Thanks @franzoni315 - I've added a fix here to automatically clear the norms https://github.com/benfred/implicit/pull/685, and will be in the next release, which I will push out later today.
For the CUDA issue, I just tried it out on colab - and it looks like the warning message disabling GPU support because of 'Cuda Error: CUDA driver version is insufficient for CUDA runtime version
is only shown on instances without a GPU, and if you select a runtime w/ a GPU it works. It seems like the issue is the CPU runtime still has CUDA installed, but doesn't have a nvidia driver - which means we get this weird error message. It's not ideal , but I can live with it.
Hi! I am using this library in order to get user and item embeddings, which will be used in a clustering algorithm in order to find clusters of users according the their preferences. I already have some good results with these clusters, and now I am trying to make this work with new items and users.
From what I researched, the
partial_fit_items
(orpartial_fit_users
) are functions designed to update the factors only for a subset of items or users. This is great, because I want to keep the factors frozen for all items, except for the new ones, since this will guarantee the stability of my embeddings during time.The problem that I am dealing with is that the
partial_fit_items
function is not positioning the new item in a meaningful location inside the embedding space. I did the following test to verify this:model_1
)model_2
)partial_fit_items
with the item that was previously removed -> (updated_model_2
)I developed a helper function that can update the factors of new items and new users, which is used in step 4:
I expected that the factors produced by
model_1
andupdated_model_2
for the selected item would be close to the same items. For instance, imagine that I have data from movies, and there is a subset with Rocky Balboa's movies. Withmodel_1
all Rocky's movies are close to each other, but withupdated_model_2
, if Rocky I was the selected movie for the test, it falls on a weird position on the embedding space, and the other Rocky movies are no longer closer to it.Can someone help me to understand if my test procedure makes sense? It would be amazing to make this to work and put this in production! Thanks!