NVIDIA-Merlin / Transformers4Rec

Transformers4Rec is a flexible and efficient library for sequential and session-based recommendation and works with PyTorch.
https://nvidia-merlin.github.io/Transformers4Rec/main
Apache License 2.0
1.07k stars 142 forks source link

[QST] Custom dataset with interacion only #741

Closed dcy0577 closed 8 months ago

dcy0577 commented 11 months ago

❓ Questions & Help

Hi, my dataset contains only the user-item interaction and timestamp. I noticed that the data used in session-based example all contain additional information as features, such like category. Can I use the same logic as in the example code to preprocess my data, without adding any additional feature columns? Can the model accept such data format?

user_id:token item_id:token timestamp:float 0 0 1681314649 0 0 1681314664 0 0 1681314674 0 0 1681314688 0 1 1681322022 0 1 1681322023 0 1 1681322024 0 1 1681322026 0 1 1681322027 0 1 1681322029 0 1 1681322030 0 1 1681322032 0 1 1681322033 0 1 1681322034 ...

rnyak commented 11 months ago

@dcy0577 you dont need extra features. you can groupby your data by user_id, create sequential features and use the item_id-list column as only input to the model.

another option is to create some temporal features since you have timestamp data already. we showcase some ways of temporal features but these are just some examples, you can be creative and create your own temporal features.

NamartaVij commented 11 months ago

@rnyak If I want to extract both the long-term and short-term interests of a user from their interactions, I can still follow a similar approach ( while considering two time windows for defining "long-term" and "short-term.")

  1. Group your data by user_id to create sequences of user-item interactions for each user.

  2. Order these sequences by the timestamp to maintain chronological order.

  3. Define a time threshold that separates long-term and short-term interactions. For example, 7 days long term and 5 days short term

  4. Split the sequences into two parts: one for long-term interactions and one for short-term interactions based on the time threshold. I followed this procedure for my above-mentioned dataset where we don't have sessions: https://nvidia-merlin.github.io/Merlin/v0.7.1/examples/getting-started-movielens/01-Download-Convert.html

question is where to mention this threshold value for long term and short term when I am trying to extract users' long term and short term interests separately using XLNet

dcy0577 commented 10 months ago

@rnyak thanks for the answer. Could you please elaborate more about max_session_length? In data preprocessing part I see:

# Truncate sequence features to first interacted 20 items 
SESSIONS_MAX_LENGTH = 20 
groupby_features_truncated = groupby_features_list >> nvt.ops.ListSlice(-SESSIONS_MAX_LENGTH) 

Is the slicing a must? If I understand correctly, the max length should be the max value appears in item_id-count?

Also in model configuration part, there are some sequence length:

max_sequence_length, d_model = 20, 320
# Define input module to process tabular input-features and to prepare masked inputs
input_module = tr.TabularSequenceFeatures.from_schema(
    schema,
    max_sequence_length=max_sequence_length,
    continuous_projection=64,
    aggregation="concat",
    d_output=d_model,
    masking="mlm",
)

# Define the config of the XLNet Transformer architecture
transformer_config = tr.XLNetConfig.build(
    d_model=d_model, n_head=8, n_layer=2, total_seq_length=20
)

training_args = tr.trainer.T4RecTrainingArguments(
                output_dir="./tmp",
                max_sequence_length=20,
                data_loader_engine='merlin',
                num_train_epochs=200, 
                dataloader_drop_last=False,
                per_device_train_batch_size = BATCH_SIZE_TRAIN,
                per_device_eval_batch_size = BATCH_SIZE_VALID,
                learning_rate=0.0005,
                fp16=True,
                report_to = [],
                logging_steps=20
            )

Do the max_sequence_length and total_seq_length here need to be consistent with SESSIONS_MAX_LENGTH?

rnyak commented 10 months ago

Yes, we expect them to be consistent for the data loader and for the input block.