Jyonn / Legommenders

A modular recommendation system that allows the selection of different components to be combined into a new recommender.
MIT License
10 stars 1 forks source link

Question about data processing code #1

Closed chiyuzhang94 closed 9 months ago

chiyuzhang94 commented 9 months ago

Hi @Jyonn

I wonder why you have 5th and 6th steps in the processing codes. They seem to randomly remove users and items. Could you explain it? https://github.com/Jyonn/Legommenders/tree/325c59cee6e3a658e0e707c3dbb12dbeb20544eb/process/goodreads

Best, Chiyu

Jyonn commented 9 months ago

Hi Chiyu. We filtered out items and users whose interactions are less than ten. The core code is the trim function.

Jyonn commented 9 months ago

Sorry. The random remove is to get a smaller dataset, so that the Openai API cost can be reduced and the model training can be accelerated.

chiyuzhang94 commented 9 months ago

Hi @Jyonn,

Thanks for the clarification.

I am trying to understand what you did in https://github.com/Jyonn/Legommenders/blob/master/process/goodreads/4_analyse_session.py

I wonder what vocabulary size you want to check. I didn't see that you loaded the book information file. What do you drop in this step?

Jyonn commented 9 months ago

Here we have two filterings.

First, filtered out items with little interactions (<5) and too much interactions (>10000). Next, we cut off the user sequence to make it shorter. The final goal is to make all the sequence less than 100. Too long sequence will also affect training efficiency.

These two steps are iteratively running until the dataset is stable.

Here we set max_seq_len = max_seq_len * 0.2 is because, if we hard code it to max_seq_len = 100, there might be some items in the sequence filtered out in the next iteration, making the length of lots of sequences shorter than 100. So we did the dynamic length decrease.

Jyonn commented 9 months ago

The book information is not parsed or tokenized in this file. All the first 6 scripts are processing the interaction file. All the book id in the interactions will be added into the book_vocab, which will count the interaction numbers of each book.

Jyonn commented 9 months ago

I also noticed that the third script (3_truncate_session) is of no use. I am not aware of it before, since the codes are written several months ago.

chiyuzhang94 commented 9 months ago

Here we have two filterings.

First, filtered out items with little interactions (<5) and too much interactions (>10000). Next, we cut off the user sequence to make it shorter. The final goal is to make all the sequence less than 100. Too long sequence will also affect training efficiency.

These two steps are iteratively running until the dataset is stable.

Here we set max_seq_len = max_seq_len * 0.2 is because, if we hard code it to max_seq_len = 100, there might be some items in the sequence filtered out in the next iteration, making the length of lots of sequences shorter than 100. So we did the dynamic length decrease.

Thanks. So, in this step, you filter out both books and users, right? I also wonder what you mean by "stable".

chiyuzhang94 commented 9 months ago

I also noticed that the third script (3_truncate_session) is of no use. I am not aware of it before, since the codes are written several months ago.

Do you mean you did not use this script in your experiments? It seems to truncate the history and negative length to 50.

chiyuzhang94 commented 9 months ago

More question,

I found that the 5th and 6th steps use input function (for example)to take some numbers during the iteration. I don't know what this is for and what number should give.

Jyonn commented 9 months ago

Here we have two filterings. First, filtered out items with little interactions (<5) and too much interactions (>10000). Next, we cut off the user sequence to make it shorter. The final goal is to make all the sequence less than 100. Too long sequence will also affect training efficiency. These two steps are iteratively running until the dataset is stable. Here we set max_seq_len = max_seq_len * 0.2 is because, if we hard code it to max_seq_len = 100, there might be some items in the sequence filtered out in the next iteration, making the length of lots of sequences shorter than 100. So we did the dynamic length decrease.

Thanks. So, in this step, you filter out both books and users, right? I also wonder what you mean by "stable".

Yes. By stable, it means, after the first step's processing, the second step's processing has no change to data, vice versa. For example, when cutting off the sentence (step-2), some items (books) may become having less than 5 interactions, so the step-1 should be operated again.

Jyonn commented 9 months ago

I also noticed that the third script (3_truncate_session) is of no use. I am not aware of it before, since the codes are written several months ago.

Do you mean you did not use this script in your experiments? It seems to truncate the history and negative length to 50.

Yes. All the truncation are performed on 4th to 6-th scripts.

Jyonn commented 9 months ago

More question,

I found that the 5th and 6th steps use input function (for example)to take some numbers during the iteration. I don't know what this is for and what number should give.

The given numbers are the number of items/users needs to be randomly removed.

Right now I am not sure about the specific given number. Since these two scripts contains random procedures, you may not obtain the fully same dataset used in the paper. But you can get similar scale Goodreads dataset by these scripts.

For example, we have 16K items and 23K users in our generated dataset. The original numbers of items and users are, let's say, 30K and 60K. For each time, you can remove 2K users/items. A smaller number each iteration will make the two-step filtering more stable.

The entire procedure would be: 30K - 2K (1st round input) - 1K (1st round self-filtering) - 27K (1st round remaining) - 2K (2nd round input) - 2K (2nd round self-filtering) - 23K (2 round remaining) ... - 16K (x-round remaining).

Another option, is to download our generated dataset by this link.

chiyuzhang94 commented 9 months ago

Got it. I would like to ask if you can share the datasets that were generated at the 7th step, such as **_inter.csv, user.csv, neg.csv, book.csv.

Jyonn commented 9 months ago

Got it. I would like to ask if you can share the datasets that were generated at the 7th step, such as **_inter.csv, user.csv, neg.csv, book.csv.

I'd love to. While I copied the tokenized data to different GPU machines to run, the original text-based files, however, are only processed and stored in one machine, which is broken down from early November, making them currently unaccessible. I can release these data when the machine get repaired.

chiyuzhang94 commented 9 months ago

Ok. I see. Hope it will be fixed soon. For the processed data, I don't know how to load the dat and npy files. Can you explain how to read them?

Jyonn commented 9 months ago

You should firstly install unitok package by

pip install unitok

Next, you can read the data by:

from UniTok import UniDep

depot = UniDep('/path/to/folder')
print(depot[0])

You can convert those IDs to tokens by:

title_vocab = depot.cols['title'].voc.vocab
title = depot[0]['title']
print(' '.join(list(map(title_vocab.i2o, title))))
Jyonn commented 9 months ago

We will release the documentation of the UniTok package soon...

Here are some useful functions:

  1. You can use len(depot) to get the sample size.
  2. All columns (attributes) are listed in depot.cols.
  3. All vocabs are listed in depots.vocs.
  4. Union method. Here training data is constructed in (userID, bookID, click) format, and book data is constructed in (bookID, title, ...) format. You can use union method: train_depot.union(book_depot), so that train_depot[0] will be in (userID, bookID, click, title, ...) format.
chiyuzhang94 commented 9 months ago

You should firstly install unitok package by

pip install unitok

Next, you can read the data by:

from UniTok import UniDep

depot = UniDep('/path/to/folder')
print(depot[0])

You can convert those IDs to tokens by:

title_vocab = depot.cols['title'].voc.vocab
title = depot[0]['title']
print(' '.join(list(map(title_vocab.i2o, title))))

Can I convert the book and user indices back to the original IDs in Goodreads dataset?

chiyuzhang94 commented 9 months ago

I tried to load your processed dataset. I cannot find the title field.

depot.cols gave me this:

{'index': <UniTok.meta.Col at 0x111779590>,
 'uid': <UniTok.meta.Col at 0x111a28490>,
 'bid': <UniTok.meta.Col at 0x1117861d0>,
 'click': <UniTok.meta.Col at 0x111786610>}

I also wonder which field the user history sequence is and the prediction impression set. Could you give more guidance? Thanks.

Jyonn commented 9 months ago

Convert back:


depot = UniDep('/path/to/train-set')
sample = depot[0]

uid, bid = sample['uid'], sample['bid']

uid_vocab = depot.cols['uid'].voc.vocab
bid_vocab = depot.cols['bid'].voc.vocab

# original user id
print(uid_vocab[uid])
# original book id
print(bid_vocab[bid])
``
Jyonn commented 9 months ago

I tried to load your processed dataset. I cannot find the title field.

depot.cols gave me this:

{'index': <UniTok.meta.Col at 0x111779590>,
 'uid': <UniTok.meta.Col at 0x111a28490>,
 'bid': <UniTok.meta.Col at 0x1117861d0>,
 'click': <UniTok.meta.Col at 0x111786610>}

I also wonder which field the user history sequence is and the prediction impression set. Could you give more guidance? Thanks.

User history is not stored in the training set, but the user folder.

depot = UniDep('/path/to/user')

user = depot[0]

history = user['history']

# history_vocab is actually bid_vocab

vocab = depot.cols['history'].voc.vocab

# get original history (book IDs) of the current user
print(list(map(vocab, history)))
Jyonn commented 9 months ago

In the Goodreads dataset, we use the uid column as the impression column

chiyuzhang94 commented 9 months ago

Thanks. I can reconstruct the dataset to the original text now.

chiyuzhang94 commented 9 months ago

I have a question about the user data.

{'uid': 4, 'history': [2968, 6566, 1539, 3332, 6053, 13605], 'neg': [5520, 12919, 5420, 13363, 2647, 5925, 13918, 11065, 3438, 11687, 16762, 2938, 7007, 1068, 2431, 6881, 901, 1994, 16373]}

My understanding is that history is all positive history of users' behavior and neg is all negative history. Did you use the negatives in your ONCE paper? I think the user-side only uses positives to encode users. Do I understand correctly?

Jyonn commented 9 months ago

Partially correct. history is used for user modeling, and neg is used for matching tasks.

Models are trained by given such tuples: (user_id, pos_item_id,, neg_item_id_1, ..., neg_item_id_K). After the item and user encoding, we can get K+1 scores by dot product. Cross-entropy loss is used to optimize the model (positive score should be larger than negative ones).

chiyuzhang94 commented 9 months ago

Partially correct. history is used for user modeling, and neg is used for matching tasks.

Models are trained by given such tuples: (user_id, pos_item_id,, neg_item_id_1, ..., neg_item_id_K). After the item and user encoding, we can get K+1 scores by dot product. Cross-entropy loss is used to optimize the model (positive score should be larger than negative ones).

I see. Are these negatives in the user file as same as negatives (with click=0) in train/dev/test files?

Jyonn commented 9 months ago

I see. Are these negatives in the user file as same as negatives (with click=0) in train/dev/test files?

Yes. You should filter out all the negative interactions when using matching-based models.

Matching Models Ranking Models
Examples NAML, NRMS DCN, DeepFM
Training Data Select positive samples Full dataset
Neg Data Use Not use
chiyuzhang94 commented 9 months ago

Hi @Jyonn,

I still have questions about the negatives. I found that the negatives in the user file are not as same as the negatives (with click=0) in train/dev/test files. For example, for the following user, the user neg set has 4 more negatives than the training set, they are '346952', '22522293', '25564446', '18089930'. Overall, there are 78,502 negatives not in the training file. So, do you sample the K negatives from the training file or the user neg set in the user file?

User ID: 9d95d0956601ed05785c48fdada5f8af
Negative interactions in Train set: ['823411', '12079', '20701984', '312881', '10560331', '831367', '66658', '16062210', '19469', '25467698', '85413', '93981', '336249', '28815', '286957', '716696', '104191', '13932']
Negative interactions in User Neg set: ['823411', '12079', '20701984', '312881', '10560331', '831367', '66658', '16062210', '19469', '25467698', '85413', '93981', '336249', '28815', '286957', '716696', '104191', '13932', '346952', '22522293', '25564446', '18089930']

I also wonder how you split the positives to user history and prediction impressions. Any details?

Thanks.

Jyonn commented 9 months ago

Please refer to process/goodreads/7_build_dataset.py. We also perform negative sampling on Line 80, to ensure each user has at least 10 negative samples. For all the baselines, i.e., matching-based methods, in the main paper, we did not use the negative samples provided by the interaction file, but used the neg set, as illustrated in the above table

Jyonn commented 9 months ago

It is also important to note that, it is a common way to perform negative sampling for matching based methods, but not for ranking based methods. You can also refer to loader/resampler.py file, line 86, where we will also perform dynamic negative sampling if current real-negative count is less than required.

chiyuzhang94 commented 9 months ago

Please refer to process/goodreads/7_build_dataset.py. We also perform negative sampling on Line 80, to ensure each user has at least 10 negative samples. For all the baselines, i.e., matching-based methods, in the main paper, we did not use the negative samples provided by the interaction file, but used the neg set, as illustrated in the above table

Ok. I see. You also split negatives to train, dev, or test. Could you explain how you split the positives to user history and prediction impressions?

Jyonn commented 9 months ago

Core code please refer to the split_inter method in process/goodreads/7_build_dataset.py. We randomize the history length, and the number of the interaction samples for each set.

Jyonn commented 9 months ago

I will close this issue, as it already contains many questions. Feel free to open issues if you meet other problems.