NVIDIA-Merlin / Transformers4Rec

Transformers4Rec is a flexible and efficient library for sequential and session-based recommendation and works with PyTorch.
https://nvidia-merlin.github.io/Transformers4Rec/main
Apache License 2.0
1.07k stars 142 forks source link

Fixes TypeError: torch.Size() takes an iterable of 'int' (item 1 is '… #729

Open Rajathbharadwaj opened 1 year ago

Rajathbharadwaj commented 1 year ago

Fixes TypeError: torch.Size() takes an iterable of 'int' (item 1 is 'NoneType') Error.

When using Transformer4Rec, whilst creating the tabular_inputs from tr.TabularSequenceFeatures.from_schema, it throws a TypeError. After a bit of inspection, the following changes solved the issue.

Fixes #728

Goals :soccer:

Implementation Details :construction:

Testing Details :mag:

rapids-bot[bot] commented 1 year ago

Pull requests from external contributors require approval from a NVIDIA-Merlin organization member with write permissions or greater before CI can begin.

rnyak commented 1 year ago

@Rajathbharadwaj hello. thanks for the PR. Can you please first provide a reproducible example with a toy dataset of your error?

Rajathbharadwaj commented 1 year ago

Hey @rnyak, definitely.

Following the Advanced NVTabular Workflow

import os
from merlin.datasets.entertainment import get_movielens

input_path = os.environ.get("INPUT_DATA_DIR", os.path.expanduser("~/merlin-framework/movielens/"))
get_movielens(variant="ml-1m", path=input_path); #noqa

from merlin.core.dispatch import get_lib

data = get_lib().read_parquet(f'{input_path}ml-1m/train.parquet').sample(frac=1)

train = data.iloc[:600_000]
valid = data.iloc[600_000:]

movies = get_lib().read_parquet(f'{input_path}ml-1m/movies_converted.parquet')

import nvtabular as nvt
from merlin.schema.tags import Tags

train_ds = nvt.Dataset(train, npartitions=2)
valid_ds = nvt.Dataset(valid)

train_ds, valid_ds
train_ds.shuffle_by_keys('userId')
valid_ds.shuffle_by_keys('userId')

genres = ['movieId'] >> nvt.ops.JoinExternal(movies, on='movieId', columns_ext=['movieId', 'genres'])

genres = genres >> nvt.ops.Categorify(freq_threshold=10)

def rating_to_binary(col):
    return col > 3

binary_rating = ['rating'] >> nvt.ops.LambdaOp(rating_to_binary) >> nvt.ops.Rename(name='binary_rating')

userId = ['userId'] >> nvt.ops.Categorify() >> nvt.ops.AddTags(tags=[Tags.USER_ID, Tags.CATEGORICAL, Tags.USER])
movieId = ['movieId'] >> nvt.ops.Categorify() >> nvt.ops.AddTags(tags=[Tags.ITEM_ID, Tags.CATEGORICAL, Tags.ITEM])
binary_rating = binary_rating >> nvt.ops.AddTags(tags=[Tags.TARGET, Tags.BINARY_CLASSIFICATION])

workflow = nvt.Workflow(userId + movieId + genres + binary_rating)

train_transformed = workflow.fit_transform(train_ds)
valid_transformed = workflow.transform(valid_ds)
valid_transformed.compute().head()
train_transformed.schema

# Issue after running this code

from transformers4rec.torch import TabularSequenceFeatures
tabular_inputs = TabularSequenceFeatures.from_schema(
        train_transformed.schema,
        embedding_dim_default=128,
        max_sequence_length=20,
        d_output=100,
        aggregation="concat",
        masking="clm"
    )

It throws the following error TypeError: torch.Size() takes an iterable of 'int' (item 1 is 'NoneType') Error.

After a bit of inspection, I found that the parameter max_sequence_length isn't passed to the tabular.py file which makes the value of max_sequence_length to None and hence the torch.Size() returns an error of NoneType at the 1st index and max_sequence_length is getting passed to that.