NVIDIA-Merlin / Transformers4Rec

Transformers4Rec is a flexible and efficient library for sequential and session-based recommendation and works with PyTorch.
https://nvidia-merlin.github.io/Transformers4Rec/main
Apache License 2.0
1.07k stars 142 forks source link

[QST] Problem with defining input module, item embedding table. #773

Closed Fluitketel0 closed 3 months ago

Fluitketel0 commented 3 months ago

❓ Questions & Help

When attempting to configure my model with TabularSequenceFeatures.from_schema(), I encounter an error, which I suspect is related to the setup of the item embedding table. Could anyone point out what I might be doing wrong?

Details

I'm working in the PyTorch 23.12 Docker image and most of my code came from trying to follow the End-to-end session-based recommendation notebook or the Model Architectures page.

Here is my nvt code:

# Load dataset
df = pq.read_table('/workspace/scriptie/data/processed/processedAndTruncated.parquet').to_pandas()
df['priceCategory'] = df['priceCategory'].astype(str)
df = df.rename(columns={'accommodationId': 'item_id'})

# Categorify categorical features
categ_feats = ['engagementType', 'periodId', 'country', 'item_id', 'aquaFun', 'adultOnly', 'forKids',
               'priceCategory']
categorify_op = categ_feats >> nvt.ops.Categorify()

userId = ['userId']
userId_op = userId >> nvt.ops.Categorify() >> nvt.ops.TagAsUserID()
# Define Groupby Workflow
groupby_feats = userId_op + categ_feats + ['engagementCountLog', 'itemRecencyLog', 'dateHoursLog', 'dayOfYearSin', 'dayOfYearCos']

# Step 2: Define groupby operation to create list columns
groupby_features =  groupby_feats >> nvt.ops.Groupby(
    groupby_cols=['userId'],
    sort_cols=['dateHoursLog'],
    aggs={
        'item_id': ['list', 'count'],
        'engagementType': ['list'],
        'periodId': ['list'],
        'country': ['list'],
        'aquaFun': ['list'],
        'adultOnly': ['list'],
        'forKids': ['list'],
        'priceCategory': ['list'],
        'dateHoursLog': ['list'],
        'itemRecencyLog': ['list'],
        'engagementCountLog': ['list'],
        'dayOfYearSin': ['list'],
        'dayOfYearCos': ['list']
    },
    name_sep='-'
)

# Ading metadata ops
metadata_features = groupby_features >> nvt.ops.AddMetadata(tags=['LIST'])

tagged_item_id = groupby_features['item_id-list'] >> nvt.ops.TagAsItemID() >> nvt.ops.AddMetadata(tags=['ITEM_ID', 'ITEM' ,'CATEGORICAL'])

cont_op = groupby_features['dateHoursLog-list', 'itemRecencyLog-list', 'engagementCountLog-list', 'dayOfYearSin-list', 'dayOfYearCos-list'] >> nvt.ops.AddMetadata(tags=[Tags.CONTINUOUS])

categ_op = groupby_features['engagementType-list', 'periodId-list', 'country-list', 'item_id-list', 'aquaFun-list', 'adultOnly-list', 'forKids-list', 'priceCategory-list', 'item_id-count'] >> nvt.ops.AddMetadata(tags=['CATEGORICAL'])

# add any other workflows
renamendUserId = groupby_features['userId'] >> nvt.ops.Rename(name ='user_id')

selected_features =  metadata_features + cont_op + categ_op + tagged_item_id 

# Filter out sessions with length 1
MINIMUM_SESSION_LENGTH = 2
final_workflow_ops = selected_features >> nvt.ops.Filter(f=lambda df: df["item_id-count"] >= MINIMUM_SESSION_LENGTH)

# Create and apply the workflow
workflow = nvt.Workflow(final_workflow_ops)

# Apply the combined workflow in a single fit_transform call
dataset = nvt.Dataset(df)
workflow.fit(dataset)
transformed_dataset = workflow.transform(dataset) 

# Save the transformed dataset with metadata to parquet
transformed_dataset.to_parquet("/workspace/scriptie/data/processed/processed_with_metadata_nvt")

And here is my current model:

from transformers4rec.torch.ranking_metric import NDCGAt, RecallAt

dataset_schema = tr.Schema().from_proto_text("/workspace/scriptie/data/processed/processed_with_metadata_nvt/schema.pbtxt")

max_sequence_length, d_model = 20, 64

inputs = tr.TabularSequenceFeatures.from_schema(
        schema = dataset_schema,
        max_sequence_length= max_sequence_length,
        masking = 'causal',
        continuous_projection=64,
        aggregation="concat",
    )

# Define the config of the XLNet Transformer architecture
transformer_config = tr.XLNetConfig.build(
    d_model=d_model, n_head=8, n_layer=2, total_seq_length=max_sequence_length
)

body = tr.SequentialBlock(
    inputs,
    tr.TransformerBlock(
        transformer_config, masking = inputs.masking
    )

)
head = tr.Head(
    body,
    tr.NextItemPredictionTask(weight_tying=True,
                                # metrics=[RecallAt(top_ks=[1, 5, 10], labels_onehot=True),  
                                #         NDCGAt(top_ks=[5, 10], labels_onehot=True)]
                             ),
)
model = tr.Model(head)

Running the code of my model will result in a key error, namely: KeyError: 'item_id-list'. Here is the complete error message:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[151], line 27
     16 transformer_config = tr.XLNetConfig.build(
     17     d_model=d_model, n_head=8, n_layer=2, total_seq_length=max_sequence_length
     18 )
     20 body = tr.SequentialBlock(
     21     inputs,
     22     tr.TransformerBlock(
   (...)
     25 
     26 )
---> 27 head = tr.Head(
     28     body,
     29     tr.NextItemPredictionTask(weight_tying=True,
     30                                 # metrics=[RecallAt(top_ks=[1, 5, 10], labels_onehot=True),  
     31                                 #         NDCGAt(top_ks=[5, 10], labels_onehot=True)]
     32                              ),
     33 )
     34 model = tr.Model(head)

File /usr/local/lib/python3.10/dist-packages/transformers4rec/torch/model/base.py:273, in Head.__init__(self, body, prediction_tasks, task_blocks, task_weights, loss_reduction, inputs)
    270     for task, val in zip(cast(List[PredictionTask], prediction_tasks), task_weights):
    271         self._task_weights[task.task_name] = val
--> 273 self.build(inputs=inputs, task_blocks=task_blocks)

File /usr/local/lib/python3.10/dist-packages/transformers4rec/torch/model/base.py:299, in Head.build(self, inputs, device, task_blocks)
    297     if task_blocks and isinstance(task_blocks, dict) and name in task_blocks:
    298         task_block = task_blocks[name]
--> 299     task.build(self.body, input_size, inputs=inputs, device=device, task_block=task_block)
    300 self.input_size = input_size

File /usr/local/lib/python3.10/dist-packages/transformers4rec/torch/model/prediction_task.py:386, in NextItemPredictionTask.build(self, body, input_size, device, inputs, task_block, pre)
    384 self.embeddings = inputs.categorical_module
    385 if not self.target_dim:
--> 386     self.target_dim = self.embeddings.item_embedding_table.num_embeddings
    387 if self.weight_tying:
    388     self.item_embedding_table = self.embeddings.item_embedding_table

File /usr/local/lib/python3.10/dist-packages/transformers4rec/torch/features/embedding.py:94, in EmbeddingFeatures.item_embedding_table(self)
     90 @property
     91 def item_embedding_table(self):
     92     assert self.item_id is not None
---> 94     return self.embedding_tables[self.item_id]

File /usr/local/lib/python3.10/dist-packages/torch/nn/modules/container.py:461, in ModuleDict.__getitem__(self, key)
    459 @_copy_to_script_wrapper
    460 def __getitem__(self, key: str) -> Module:
--> 461     return self._modules[key]

KeyError: 'item_id-list'

After executing the command inputs.item_embedding_table, I encounter a KeyError identical to one I've experienced above.

rnyak commented 3 months ago

'm working in the PyTorch 23.12 Docker image

Is that merlin image? if not can you please use merlin-pytorch:23.08 image.. it comes everything installed. you dont need to install anything.

Fluitketel0 commented 3 months ago

Is that merlin image?

Thanks for the reply. Yes that is the one I mean, I will try my code with merlin-pytorch:23.08 first thing Monday morning.

Fluitketel0 commented 3 months ago

Thank you for your suggestion @rnyak, but I still receive the same KeyError: 'item_id-list' error. Do you have any other ideas?

Fluitketel0 commented 3 months ago

I had a mistake in my code, I did not correctly apply the categorify_op:

# Define Groupby Workflow
groupby_feats = userId_op + categ_feats + ...

I added the columns instead of the op. Thanks for your help @rnyak .