Closed swapnilpanda closed 1 year ago
you can use the unique.item_id.parquet
to do the mapping to the original item_ids but couple things to keep in mind:
Say you got [501, 64] from model.heads[0].body[0].item_embedding_table.weight.cpu().detach().numpy().shape . that means, you have 500 unique items but you have +1 outputted and this +1 corresponds to the OOV/nulls. You won’t have OOVs in your train set but you might have in your valid set, and they are mapped to 0 when we apply Categorify() op on validation set. Note that we also have padding value in TF4Rec model, and it is 0 by default. So what you get basically is the embedding weights for encoded item_ids in this order [0, 1, 2, ..., 500]. To convert this to the original item_ids is straightforward. Just use the categories/unique.item_id.parquet file and do the mapping.
If we use start_index =1 in the Categorify() op then the item_id encoding starts from 2 not 1. Because we reserve 0 for padding, 1 for OOV +nulls, and then the other item categories start from 2. In that case, you will get 2 + n_unique_items from model.heads[0].body[0].item_embedding_table.weight.cpu().detach().numpy() . Meaning, you will get embedding values for 2 + n_unique_items, where first two are for padding and OOV+nulls. One caveat here is that our unique.item_id parquet file (mapping file) does not reflect the shifted indices when we set start_index > 0 in the Categorify. You need to shift the index accordingly. You only need to shift the index of the parquet file +1, if you se the start_index =1. you can check this ticket for further info:
Please refer to these tickets for further discussions and code examples:
https://github.com/NVIDIA-Merlin/Transformers4Rec/issues/511
https://github.com/NVIDIA-Merlin/Transformers4Rec/issues/612
Thanks for the quick response. So if I understand you correctly this code should give me the correct product_ids
df = pd.read_parquet("/kaggle/input/etl-try1/categories/unique.prev_items.parquet") df1 = df['prev_items'] decoded = []
for next_item in top_preds: decoded.append([df1.iloc[e-1] for e in next_item])
decoded = np.array(decoded) As the given item_id is +1 of the original parquet file.
Hi swapnil, have you completed this implementation. I have some doubts. Could you please help
Yes I have done this implementation, the function I am using is def _decoder(self, recommendation):
'''decode sequeces to ASIN ids'''
decoded = []
for next_item in recommendation:
decoded.append([self.items.iloc[e-1] for e in next_item])
decoded = np.array(decoded)
return decoded
where items is the unique.prev_items.parquet'
Hi @swapnilpanda
may I know how to do feature aggregation, my dataset is movielens , and my inputs are userid, movieid and genre and label is rating. So may I know how to do element wise concat. It would be great if you can help me in this, as I am getting error : might be this is the issue. I followed this approach to creating the schema, but when i applied xlnet . i got an error https://nvidia-merlin.github.io/Merlin/stable/examples/getting-started-movielens/01-Download-Convert.html
@NamartaVij
For decoding both input and output items:
import pandas as pd
items = pd.read_parquet("./categories/unique.productId.parquet")['productId'] decoded_output = [] for next_item in prediction.predictions[0]: decoded_output.append([items.iloc[e-1] for e in next_item])
test = pd.read_parquet('./data/sessions_by_day/23/test.parquet') decoded_input = [] for next_item in list(test['productId-list']): decoded_input.append([items.iloc[e-1] for e in next_item])
print(len(decoded_output),len(decoded_input))
predictions = pd.DataFrame({'input': decoded_input, 'prediction': decoded_output})
Discussed in https://github.com/NVIDIA-Merlin/Transformers4Rec/discussions/704