Closed ardulat closed 1 year ago
@ardulat can you please share your NVT workflow script? starting from how you read the pandas df..
Secondly, you wrote if I already have an "items" list in my data frame?
?
can you please share the screenshot of your raw data frame?is your items
column already a list in your raw df? if yes, then you dont need to do nvt.ops.Groupby()
step since this step is to create sequential (list) columns. we need more info from your side to help you.
thanks.
@rnyak sure! Here is the screenshot of my data frame:
I am using "items" column only (it's the concatenation of columns "prev_items" and "next_item").
And here is the script I am using:
# Categorify categorical features
categ_feats = ['items'] >> nvt.ops.Categorify(start_index=1)
sequence_features = categ_feats['items'] >> TagAsItemID()
workflow = nvt.Workflow(sequence_features)
dataset = nvt.Dataset(train_sessions)
workflow.fit_transform(dataset).to_parquet('train_sessions.parquet')
@ardulat can you try this please? you should give a path in the .to_parquet() method.
# Categorify categorical features
categ_feats = ['items'] >> nvt.ops.Categorify(start_index=1)
sequence_features = categ_feats >> TagAsItemID()
workflow = nvt.Workflow(sequence_features)
dataset = nvt.Dataset(train_sessions)
sessions_gdf = workflow.fit_transform(dataset)
sessions_gdf.to_parquet('./train_sessions')
sessions_gdf.schema
you can also check your processed parquet file and its schema by reading it as a Dataset object:
from merlin.io import Dataset
train = Dataset('./train_sessions/part_0.parquet')
train.schema
@rnyak, thanks for the quick reply! I guess you meant sessions_gdf.output_schema
?
Here is the result I get:
So, everything is fine and as expected. However, when I feed it to tr.TabularSequenceFeatures
, it gives me an error KeyError: "attribute 'items' already exists".
@ardulat please share your model script as well.. thanks.
also what Merlin version are you using? how did you install Merlin libraries? thanks.
@rnyak, I didn't set up the model yet, but here is the script I am using to create the input module for the transformers model:
max_sequence_length, d_model = 20, 320
# Define input module to process tabular input-features and to prepare masked inputs
input_module = tr.TabularSequenceFeatures.from_schema(
schema,
max_sequence_length=max_sequence_length,
continuous_projection=64,
aggregation="concat",
d_output=d_model,
masking="mlm",
)
The version of both NVTabular and Transformers4Rec is 23.04.00. I installed it in the Google Colab environment with the pip command: pip install transformers4rec[pytorch,nvtabular]
@ardulat you can also check out these instructions: https://medium.com/nvidia-merlin/how-to-run-merlin-on-google-colab-83b5805c63e0 to install merlin on colab.
if you rename your column it will solve your issue. something like below:
SESSIONS_MAX_LENGTH = 5 # change this number whatever number you want to set
categ_feats = ['items'] >> nvt.ops.Categorify(start_index=1) >> nvt.ops.Rename(name = 'item_id-list')
sequence_features = categ_feats >> nvt.ops.TagAsItemID()
sequence_features_truncated = (sequence_features
>> nvt.ops.ListSlice(-SESSIONS_MAX_LENGTH)
)
workflow = nvt.Workflow(sequence_features_truncated)
dataset = nvt.Dataset(train_sessions)
sessions_gdf = workflow.fit_transform(dataset)
sessions_gdf.to_parquet('./processed_nvt')
In addition since you only have one column and you dont have continuous column
in your dataset, you dont need continuous_projection=64,
in the tr.TabularSequenceFeatures.from_schema(..)
. you can remove it.
@ardulat is this still an issue for you? did you see my msg above?
@rnyak, thank you! The instruction you sent and the code block above helped! I was able to create TabularSequenceFeatures
. I am closing the issue.
❓ Questions & Help
Details
Hello! First, a bit of context: I am using NVTabular for further usage in Transformers4Rec. Hence, I am working on session-based recommendations. Currently, I only have one feature, an "items" list of product IDs (string). So, how do I construct a
Schema
necessary fortransformers4rec.torch.TabularSequenceFeatures
?More context: I went through some examples of notebooks in Transformers4Rec documentation, but the main issue is related to NVTabular preprocessing. I have tried using
nvt.Workflow
to create a schema from a pandas data frame with an "items" list feature (as in the example), but I get the following:In contrast, I am trying to get something like this:
The
item_id-list
have tags saying these are categorical features (further necessary forTabularSequenceFeatures
). How do I get the representation of the same tags if I already have an "items" list in my data frame?