Open Rajathbharadwaj opened 1 year ago
Pull requests from external contributors require approval from a NVIDIA-Merlin
organization member with write
permissions or greater before CI can begin.
@Rajathbharadwaj hello. thanks for the PR. Can you please first provide a reproducible example with a toy dataset of your error?
Hey @rnyak, definitely.
Following the Advanced NVTabular Workflow
import os
from merlin.datasets.entertainment import get_movielens
input_path = os.environ.get("INPUT_DATA_DIR", os.path.expanduser("~/merlin-framework/movielens/"))
get_movielens(variant="ml-1m", path=input_path); #noqa
from merlin.core.dispatch import get_lib
data = get_lib().read_parquet(f'{input_path}ml-1m/train.parquet').sample(frac=1)
train = data.iloc[:600_000]
valid = data.iloc[600_000:]
movies = get_lib().read_parquet(f'{input_path}ml-1m/movies_converted.parquet')
import nvtabular as nvt
from merlin.schema.tags import Tags
train_ds = nvt.Dataset(train, npartitions=2)
valid_ds = nvt.Dataset(valid)
train_ds, valid_ds
train_ds.shuffle_by_keys('userId')
valid_ds.shuffle_by_keys('userId')
genres = ['movieId'] >> nvt.ops.JoinExternal(movies, on='movieId', columns_ext=['movieId', 'genres'])
genres = genres >> nvt.ops.Categorify(freq_threshold=10)
def rating_to_binary(col):
return col > 3
binary_rating = ['rating'] >> nvt.ops.LambdaOp(rating_to_binary) >> nvt.ops.Rename(name='binary_rating')
userId = ['userId'] >> nvt.ops.Categorify() >> nvt.ops.AddTags(tags=[Tags.USER_ID, Tags.CATEGORICAL, Tags.USER])
movieId = ['movieId'] >> nvt.ops.Categorify() >> nvt.ops.AddTags(tags=[Tags.ITEM_ID, Tags.CATEGORICAL, Tags.ITEM])
binary_rating = binary_rating >> nvt.ops.AddTags(tags=[Tags.TARGET, Tags.BINARY_CLASSIFICATION])
workflow = nvt.Workflow(userId + movieId + genres + binary_rating)
train_transformed = workflow.fit_transform(train_ds)
valid_transformed = workflow.transform(valid_ds)
valid_transformed.compute().head()
train_transformed.schema
# Issue after running this code
from transformers4rec.torch import TabularSequenceFeatures
tabular_inputs = TabularSequenceFeatures.from_schema(
train_transformed.schema,
embedding_dim_default=128,
max_sequence_length=20,
d_output=100,
aggregation="concat",
masking="clm"
)
It throws the following error
TypeError: torch.Size() takes an iterable of 'int' (item 1 is 'NoneType') Error.
After a bit of inspection, I found that the parameter max_sequence_length
isn't passed to the tabular.py
file which makes the value of max_sequence_length
to None and hence the torch.Size() returns an error of NoneType
at the 1st index and max_sequence_length
is getting passed to that.
Fixes TypeError: torch.Size() takes an iterable of 'int' (item 1 is 'NoneType') Error.
When using Transformer4Rec, whilst creating the
tabular_inputs
fromtr.TabularSequenceFeatures.from_schema
, it throws a TypeError. After a bit of inspection, the following changes solved the issue.Fixes #728
Goals :soccer:
Implementation Details :construction:
Testing Details :mag: