NVIDIA-Merlin / Transformers4Rec

Transformers4Rec is a flexible and efficient library for sequential and session-based recommendation and works with PyTorch.
https://nvidia-merlin.github.io/Transformers4Rec/main
Apache License 2.0
1.11k stars 146 forks source link

[QST] How to construct schema based on a single "items" list feature? #703

Closed ardulat closed 1 year ago

ardulat commented 1 year ago

❓ Questions & Help

Details

Hello! First, a bit of context: I am using NVTabular for further usage in Transformers4Rec. Hence, I am working on session-based recommendations. Currently, I only have one feature, an "items" list of product IDs (string). So, how do I construct a Schema necessary for transformers4rec.torch.TabularSequenceFeatures?

More context: I went through some examples of notebooks in Transformers4Rec documentation, but the main issue is related to NVTabular preprocessing. I have tried using nvt.Workflow to create a schema from a pandas data frame with an "items" list feature (as in the example), but I get the following:

Screenshot 2023-05-16 at 10 00 14 PM

In contrast, I am trying to get something like this:

Screenshot 2023-05-16 at 10 00 00 PM

The item_id-list have tags saying these are categorical features (further necessary for TabularSequenceFeatures). How do I get the representation of the same tags if I already have an "items" list in my data frame?

rnyak commented 1 year ago

@ardulat can you please share your NVT workflow script? starting from how you read the pandas df..

Secondly, you wrote if I already have an "items" list in my data frame? ?

can you please share the screenshot of your raw data frame?is your items column already a list in your raw df? if yes, then you dont need to do nvt.ops.Groupby() step since this step is to create sequential (list) columns. we need more info from your side to help you.

thanks.

ardulat commented 1 year ago

@rnyak sure! Here is the screenshot of my data frame:

Screenshot 2023-05-17 at 9 53 18 AM

I am using "items" column only (it's the concatenation of columns "prev_items" and "next_item").

And here is the script I am using:

# Categorify categorical features
categ_feats = ['items'] >> nvt.ops.Categorify(start_index=1)
sequence_features = categ_feats['items'] >> TagAsItemID()

workflow = nvt.Workflow(sequence_features)
dataset = nvt.Dataset(train_sessions)

workflow.fit_transform(dataset).to_parquet('train_sessions.parquet')
rnyak commented 1 year ago

@ardulat can you try this please? you should give a path in the .to_parquet() method.

# Categorify categorical features
categ_feats = ['items'] >> nvt.ops.Categorify(start_index=1)
sequence_features = categ_feats >> TagAsItemID()

workflow = nvt.Workflow(sequence_features)
dataset = nvt.Dataset(train_sessions)

sessions_gdf = workflow.fit_transform(dataset)

sessions_gdf.to_parquet('./train_sessions')

sessions_gdf.schema

you can also check your processed parquet file and its schema by reading it as a Dataset object:

from merlin.io import Dataset
train = Dataset('./train_sessions/part_0.parquet')
train.schema
ardulat commented 1 year ago

@rnyak, thanks for the quick reply! I guess you meant sessions_gdf.output_schema?

Here is the result I get:

Screenshot 2023-05-17 at 12 10 43 PM

So, everything is fine and as expected. However, when I feed it to tr.TabularSequenceFeatures, it gives me an error KeyError: "attribute 'items' already exists".

Here is the full traceback of the error ``` --------------------------------------------------------------------------- KeyError Traceback (most recent call last) [](https://localhost:8080/#) in () 1 max_sequence_length, d_model = 20, 320 2 # Define input module to process tabular input-features and to prepare masked inputs ----> 3 input_module = tr.TabularSequenceFeatures.from_schema( 4 schema, 5 max_sequence_length=max_sequence_length, 8 frames [/usr/local/lib/python3.10/dist-packages/transformers4rec/torch/features/sequence.py](https://localhost:8080/#) in from_schema(cls, schema, continuous_tags, categorical_tags, aggregation, automatic_build, max_sequence_length, continuous_projection, continuous_soft_embeddings, projection, d_output, masking, **kwargs) 191 Returns ``TabularFeatures`` from a dataset schema 192 """ --> 193 output: TabularSequenceFeatures = super().from_schema( # type: ignore 194 schema=schema, 195 continuous_tags=continuous_tags, [/usr/local/lib/python3.10/dist-packages/transformers4rec/torch/features/tabular.py](https://localhost:8080/#) in from_schema(cls, schema, continuous_tags, categorical_tags, aggregation, automatic_build, max_sequence_length, continuous_projection, continuous_soft_embeddings, **kwargs) 170 ) 171 if categorical_tags: --> 172 maybe_categorical_module = cls.EMBEDDING_MODULE_CLASS.from_schema( 173 schema, tags=categorical_tags, **kwargs 174 ) [/usr/local/lib/python3.10/dist-packages/transformers4rec/torch/features/embedding.py](https://localhost:8080/#) in from_schema(cls, schema, embedding_dims, embedding_dim_default, infer_embedding_sizes, infer_embedding_sizes_multiplier, embeddings_initializers, combiner, tags, item_id, automatic_build, max_sequence_length, aggregation, pre, post, **kwargs) 205 return None 206 --> 207 output = cls(feature_config, item_id=item_id, pre=pre, post=post, aggregation=aggregation) 208 209 if automatic_build and schema: [/usr/local/lib/python3.10/dist-packages/transformers4rec/torch/features/sequence.py](https://localhost:8080/#) in __init__(self, feature_config, item_id, padding_idx, pre, post, aggregation, schema) 64 ): 65 self.padding_idx = padding_idx ---> 66 super(SequenceEmbeddingFeatures, self).__init__( 67 feature_config=feature_config, 68 item_id=item_id, [/usr/local/lib/python3.10/dist-packages/transformers4rec/torch/features/embedding.py](https://localhost:8080/#) in __init__(self, feature_config, item_id, pre, post, aggregation, schema) 84 embedding_tables[name] = self.table_to_embedding_module(table) 85 ---> 86 self.embedding_tables = torch.nn.ModuleDict(embedding_tables) 87 88 @property [/usr/local/lib/python3.10/dist-packages/torch/nn/modules/container.py](https://localhost:8080/#) in __init__(self, modules) 455 super().__init__() 456 if modules is not None: --> 457 self.update(modules) 458 459 @_copy_to_script_wrapper [/usr/local/lib/python3.10/dist-packages/torch/nn/modules/container.py](https://localhost:8080/#) in update(self, modules) 531 if isinstance(modules, (OrderedDict, ModuleDict, container_abcs.Mapping)): 532 for key, module in modules.items(): --> 533 self[key] = module 534 else: 535 # modules here can be a list with two items [/usr/local/lib/python3.10/dist-packages/torch/nn/modules/container.py](https://localhost:8080/#) in __setitem__(self, key, module) 462 463 def __setitem__(self, key: str, module: Module) -> None: --> 464 self.add_module(key, module) 465 466 def __delitem__(self, key: str) -> None: [/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in add_module(self, name, module) 600 torch.typename(name))) 601 elif hasattr(self, name) and name not in self._modules: --> 602 raise KeyError("attribute '{}' already exists".format(name)) 603 elif '.' in name: 604 raise KeyError("module name can't contain \".\", got: {}".format(name)) KeyError: "attribute 'items' already exists" ```
rnyak commented 1 year ago

@ardulat please share your model script as well.. thanks.

also what Merlin version are you using? how did you install Merlin libraries? thanks.

ardulat commented 1 year ago

@rnyak, I didn't set up the model yet, but here is the script I am using to create the input module for the transformers model:

max_sequence_length, d_model = 20, 320
# Define input module to process tabular input-features and to prepare masked inputs
input_module = tr.TabularSequenceFeatures.from_schema(
    schema,
    max_sequence_length=max_sequence_length,
    continuous_projection=64,
    aggregation="concat",
    d_output=d_model,
    masking="mlm",
)

The version of both NVTabular and Transformers4Rec is 23.04.00. I installed it in the Google Colab environment with the pip command: pip install transformers4rec[pytorch,nvtabular]

rnyak commented 1 year ago

@ardulat you can also check out these instructions: https://medium.com/nvidia-merlin/how-to-run-merlin-on-google-colab-83b5805c63e0 to install merlin on colab.

if you rename your column it will solve your issue. something like below:

SESSIONS_MAX_LENGTH = 5      # change this number whatever number you want to set

categ_feats = ['items'] >> nvt.ops.Categorify(start_index=1) >> nvt.ops.Rename(name = 'item_id-list')
sequence_features = categ_feats >> nvt.ops.TagAsItemID()

sequence_features_truncated = (sequence_features
    >> nvt.ops.ListSlice(-SESSIONS_MAX_LENGTH) 
)

workflow = nvt.Workflow(sequence_features_truncated)
dataset = nvt.Dataset(train_sessions)
sessions_gdf = workflow.fit_transform(dataset)
sessions_gdf.to_parquet('./processed_nvt')

In addition since you only have one column and you dont have continuous column in your dataset, you dont need continuous_projection=64, in the tr.TabularSequenceFeatures.from_schema(..). you can remove it.

rnyak commented 1 year ago

@ardulat is this still an issue for you? did you see my msg above?

ardulat commented 1 year ago

@rnyak, thank you! The instruction you sent and the code block above helped! I was able to create TabularSequenceFeatures. I am closing the issue.