How to create a new index for a downstream dataset ?

RUCAIBox / VQ-Rec

[WWW'23] PyTorch implementation for "Learning Vector-Quantized Item Representation for Transferable Sequential Recommenders".

64 stars 8 forks source link

How to create a new index for a downstream dataset ? #3

Closed shamanez closed 1 year ago

shamanez commented 1 year ago

The repository comes with an index file for each data set, as mentioned here.

What if I have a new dataset? How can I finetune? Can I use the same index?

hyp1231 commented 1 year ago

We should create the index file for each downstream dataset. As mentioned in [link].

python pq.py --dataset your_customized_dataset --gpu_id 0

shamanez commented 1 year ago

Thanks, it worked.

One more question, when I read the paper, we mainly used product titles to represent the items and their sequences. So mainly, the argument is we can come up with universal/transferrable recordations.

But when we finetune, we create an index similar to the number of unique items with the BERT embeddings. My question is, can this kind of system address the cold start problem?

Let's say I get a new item with a different title. Can I still use the same trained model ?

I went through the paper but couldn't find the answer. Sorry about that.

hyp1231 commented 1 year ago

Hi, glad to hear that it worked!

I think that the model can handle cold-start items. After fine-tuning the models, actually we have an indexing system (i.e., the PQ centroids), not just a set of indices. When a new item comes, we first get its BERT feature, then get the item code (or called index) using the indexing system of the downstream dataset. The new item code has a similar distribution with other existing items, being able to fit the fine-tuned model. We can also refer to Figure 4 in our paper for some analyses of recommendations on cold-start items.

In this repo, a detailed tutorial on "adding brand-new items to a fine-tuned model" has not been included yet. I'll consider adding one (maybe a Jupyter notebook) in the future when I have more spare time. Thanks for the great suggestion!

shamanez commented 1 year ago

Yeah, it would be very useful, please share a notebook. I guess the main advantage of VQ-rec is its ability to use titles rather than item-embedding look-up tables.

I am applying this for a custom dataset that consists of E-commerce items. My test dataset consists of 100K examples and 45K unique items. I managed to get around 16% for the top 10 recall metric, which is similar to what the paper produced. So kudos to the great repository.

shamanez commented 1 year ago

At the same time, I compared the performance with the CORE algorithm and found that the CORE method outperforms this method by 6%.

But again, CORE uses an item embedding table. Is this behavior expected with transferable recsys?

hyp1231 commented 1 year ago

Thanks, will add it later :)

Yes, CORE may have very good performance, especially when we usually have repeat interactions in the sessions. And also the performance of transferable recsys models could vary, depending on the quality of item text and others. As a result, it's hard to predict before empirical experiments. Besides, how to combine the benefits of both is also actively studied.

shamanez commented 1 year ago

Yeah, I get it. I would love to test VQ-rec with news and media recommendation where the titles hold a lot more context.