NVIDIA-Merlin / Transformers4Rec

Transformers4Rec is a flexible and efficient library for sequential and session-based recommendation and works with PyTorch.
https://nvidia-merlin.github.io/Transformers4Rec/main
Apache License 2.0
1.08k stars 142 forks source link

Converting item-list prediction to original item-id in the ETL work. #705

Closed swapnilpanda closed 1 year ago

swapnilpanda commented 1 year ago

Discussed in https://github.com/NVIDIA-Merlin/Transformers4Rec/discussions/704

Originally posted by **swapnilpanda** May 18, 2023 I have been going through the example end-to-end prediction on transformer4rec, in the ETL notebook the item-id was shown in 6 digits, ![image](https://github.com/NVIDIA-Merlin/Transformers4Rec/assets/56393206/94bbfed7-53c0-4f57-9a66-8ffa7d46b46c) **however in the inference part, the item-id list we got had the item-id in 1 to 4 digit.** ![image](https://github.com/NVIDIA-Merlin/Transformers4Rec/assets/56393206/68f4581c-a891-41cf-bff7-8d48aac4a9d1) How do I track back to the original item-id so that I can actually make the inference and check my results? In other words, how is the product id in original data linked to item-id in the prediction, how are they mapped? Any help will be appreciated, thanks.
rnyak commented 1 year ago

you can use the unique.item_id.parquet to do the mapping to the original item_ids but couple things to keep in mind:

Say you got [501, 64] from model.heads[0].body[0].item_embedding_table.weight.cpu().detach().numpy().shape . that means, you have 500 unique items but you have +1 outputted and this +1 corresponds to the OOV/nulls. You won’t have OOVs in your train set but you might have in your valid set, and they are mapped to 0 when we apply Categorify() op on validation set. Note that we also have padding value in TF4Rec model, and it is 0 by default. So what you get basically is the embedding weights for encoded item_ids in this order [0, 1, 2, ..., 500]. To convert this to the original item_ids is straightforward. Just use the categories/unique.item_id.parquet file and do the mapping.

If we use start_index =1 in the Categorify() op then the item_id encoding starts from 2 not 1. Because we reserve 0 for padding, 1 for OOV +nulls, and then the other item categories start from 2. In that case, you will get 2 + n_unique_items from model.heads[0].body[0].item_embedding_table.weight.cpu().detach().numpy() . Meaning, you will get embedding values for 2 + n_unique_items, where first two are for padding and OOV+nulls. One caveat here is that our unique.item_id parquet file (mapping file) does not reflect the shifted indices when we set start_index > 0 in the Categorify. You need to shift the index accordingly. You only need to shift the index of the parquet file +1, if you se the start_index =1. you can check this ticket for further info:

Please refer to these tickets for further discussions and code examples:

https://github.com/NVIDIA-Merlin/Transformers4Rec/issues/511

https://github.com/NVIDIA-Merlin/Transformers4Rec/issues/612

swapnilpanda commented 1 year ago

Thanks for the quick response. So if I understand you correctly this code should give me the correct product_ids

df = pd.read_parquet("/kaggle/input/etl-try1/categories/unique.prev_items.parquet") df1 = df['prev_items'] decoded = []

for next_item in top_preds: decoded.append([df1.iloc[e-1] for e in next_item])

decoded = np.array(decoded) As the given item_id is +1 of the original parquet file.

NamartaVij commented 1 year ago

Hi swapnil, have you completed this implementation. I have some doubts. Could you please help

swapnilpanda commented 1 year ago

Yes I have done this implementation, the function I am using is def _decoder(self, recommendation):

        '''decode sequeces to ASIN ids'''

        decoded = []
        for next_item in recommendation:

            decoded.append([self.items.iloc[e-1] for e in next_item])

        decoded = np.array(decoded)
        return decoded

where items is the unique.prev_items.parquet'

NamartaVij commented 1 year ago

Hi @swapnilpanda

may I know how to do feature aggregation, my dataset is movielens , and my inputs are userid, movieid and genre and label is rating. So may I know how to do element wise concat. It would be great if you can help me in this, as I am getting error : might be this is the issue. I followed this approach to creating the schema, but when i applied xlnet . i got an error https://nvidia-merlin.github.io/Merlin/stable/examples/getting-started-movielens/01-Download-Convert.html

vivpra89 commented 1 year ago

@NamartaVij

For decoding both input and output items:

import pandas as pd

items = pd.read_parquet("./categories/unique.productId.parquet")['productId'] decoded_output = [] for next_item in prediction.predictions[0]: decoded_output.append([items.iloc[e-1] for e in next_item])

test = pd.read_parquet('./data/sessions_by_day/23/test.parquet') decoded_input = [] for next_item in list(test['productId-list']): decoded_input.append([items.iloc[e-1] for e in next_item])

print(len(decoded_output),len(decoded_input))

predictions = pd.DataFrame({'input': decoded_input, 'prediction': decoded_output})