Data clarification - Githubissues

ananv21 commented 1 year ago

Hello, For the train.txt, valid.txt, and test.txt files for a dataset, does the 'target' column contain the ID of the most recent item purchased by a user?

Also, what is the item IDs json file supposed to contain? (In the case of the beauty datasets this file is titled "item_name.jsonl", is there any code to create this file from my own custom dataset)?

mssssss123 commented 1 year ago

a1: The target column represents the items that the user has recently interacted with, which is the ground-truth at time t that we want to predict. The user_id column is not actually used. The seq column represents the interactive items from user 1 to time t-1. The id 0 is only used for filling and has no actual meaning.

a2: gen_all_items.py is used to generate token json file of all item texts. Then this json file is used to select positive sample and negative sample items in the build_train.py file to avoid repeated tokenize. In build_train.py, we use to build the text token version of the training set and validation set.

The reason why we do this is because the openmatch framework used only supports tokenized data of input text.

Hopefully these will help you structure your data better. Contact us anytime if you have any questions :)

ananv21 commented 1 year ago

Thank you for your help!

OpenMatch / TASTE

Data clarification #6