NVIDIA-Merlin / Transformers4Rec

Transformers4Rec is a flexible and efficient library for sequential and session-based recommendation and works with PyTorch.
https://nvidia-merlin.github.io/Transformers4Rec/main
Apache License 2.0
1.07k stars 142 forks source link

set schema for movielens dataset #735

Closed NamartaVij closed 10 months ago

NamartaVij commented 1 year ago

could you please help to resolve this error. I am trying to train a model using Transformers4rec , but this is the error I am getting.

_part of CODE_

**### trainer = tr.Trainer(
    model=model,
    args=training_args,
    schema=schema,
    compute_metrics=True,
)
Using amp fp16 backend
%%time
start_time_window_index = 1
final_time_window_index = 4
for time_index in range(start_time_window_index, final_time_window_index):
    # Set data 
    time_index_train = time_index
    time_index_eval = time_index + 1
    # train_paths = glob.glob(os.path.join(OUTPUT_DIR, f"{time_index_train}/train.parquet"))
    # eval_paths = glob.glob(os.path.join(OUTPUT_DIR, f"{time_index_eval}/valid.parquet"))
    # Train on day related to time_index 
    print('*'*20)
    print("Launch training for day %s are:" %time_index)
    print('*'*20 + '\n')
    trainer.train_dataset_or_path = train_transformed
    trainer.reset_lr_scheduler()
    trainer.train()
    trainer.state.global_step +=1
    # Evaluate on the following day
    trainer.eval_dataset_or_path = valid_transformed
    train_metrics = trainer.evaluate(metric_key_prefix='eval')
    print('*'*20)
    print("Eval results for day %s are:\t" %time_index_eval)
    print('\n' + '*'*20 + '\n')
    for key in sorted(train_metrics.keys()):
        print(" %s = %s" % (key, str(train_metrics[key]))) 
    wipe_memory()**

output Running training Num examples = 600192 Num Epochs = 10 Instantaneous batch size per device = 384 Total train batch size (w. parallel, distributed & accumulation) = 384 Gradient Accumulation steps = 1 Total optimization steps = 15630


Launch training for day 1 are:


Output Schema -> [{'name': 'userId', 'tags': {<Tags.USER: 'user'>, <Tags.ID: 'id'>, <Tags.CATEGORICAL: 'categorical'>}, 'properties': {'num_buckets': None, 'freq_threshold': 0, 'max_size': 0, 'cat_path': './/categories/unique.userId.parquet', 'domain': {'min': 0, 'max': 6042, 'name': 'userId'}, 'embedding_sizes': {'cardinality': 6043, 'dimension': 210}}, 'dtype': DType(name='int64', element_type=<ElementType.Int: 'int'>, element_size=64, element_unit=None, signed=True, shape=Shape(dims=(Dimension(min=0, max=None),))), 'is_list': False, 'is_ragged': False}, {'name': 'movieId', 'tags': {<Tags.ITEM: 'item'>, <Tags.LIST: 'list'>, <Tags.ID: 'id'>, <Tags.CATEGORICAL: 'categorical'>}, 'properties': {'num_buckets': None, 'freq_threshold': 10, 'max_size': 0, 'cat_path': './/categories/unique.movieId.parquet', 'domain': {'min': 0, 'max': 3103, 'name': 'movieId'}, 'embedding_sizes': {'cardinality': 3104, 'dimension': 144}}, 'dtype': DType(name='int64', element_type=<ElementType.Int: 'int'>, element_size=64, element_unit=None, signed=True, shape=Shape(dims=(Dimension(min=0, max=None),))), 'is_list': False, 'is_ragged': False}, {'name': 'genres', 'tags': {<Tags.LIST: 'list'>, <Tags.CATEGORICAL: 'categorical'>}, 'properties': {'num_buckets': None, 'freq_threshold': 10, 'max_size': 0, 'cat_path': './/categories/unique.genres.parquet', 'domain': {'min': 0, 'max': 20, 'name': 'genres'}, 'embedding_sizes': {'cardinality': 21, 'dimension': 16}, 'value_count': {'min': 0, 'max': None}}, 'dtype': DType(name='int64', element_type=<ElementType.Int: 'int'>, element_size=64, element_unit=None, signed=True, shape=Shape(dims=(Dimension(min=0, max=None), Dimension(min=0, max=None)))), 'is_list': True, 'is_ragged': True}, {'name': 'binary_rating', 'tags': {<Tags.BINARY_CLASSIFICATION: 'binary_classification'>, <Tags.TARGET: 'target'>}, 'properties': {}, 'dtype': DType(name='bool', element_type=<ElementType.Bool: 'bool'>, element_size=None, element_unit=None, signed=None, shape=Shape(dims=(Dimension(min=0, max=None),))), 'is_list': False, 'is_ragged': False}] Sparse Feats -> ['movieId', 'genres'] Padding Lengths {'movieId': 20, 'genres': 20} Item IDS -> torch.Size([384])

AssertionError Traceback (most recent call last) File :15

File /usr/local/lib/python3.8/dist-packages/transformers/trainer.py:1316, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs) 1314 tr_loss_step = self.training_step(model, inputs) 1315 else: -> 1316 tr_loss_step = self.training_step(model, inputs) 1318 if ( 1319 args.logging_nan_inf_filter 1320 and not is_torch_tpu_available() 1321 and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step)) 1322 ): 1323 # if loss is nan or inf simply add the average of previous logged losses 1324 tr_loss += tr_loss / (1 + self.state.global_step - self._globalstep_last_logged)

File /usr/local/lib/python3.8/dist-packages/transformers/trainer.py:1847, in Trainer.training_step(self, model, inputs) 1845 if self.use_amp: 1846 with autocast(): -> 1847 loss = self.compute_loss(model, inputs) 1848 else: 1849 loss = self.compute_loss(model, inputs)

File /usr/local/lib/python3.8/dist-packages/transformers4rec/torch/trainer.py:323, in Trainer.compute_loss(self, model, inputs, return_outputs) 316 """ 317 Overriding :obj:Trainer.compute_loss() 318 To allow for passing the targets to the model's forward method 319 How the loss is computed by Trainer. By default, all Transformers4Rec models return 320 a dictionary of three elements {'loss', 'predictions', and 'labels} 321 """ 322 inputs, targets = inputs --> 323 outputs = model(inputs, targets=targets, training=True) 324 # Save past state if it exists 325 # TODO: this needs to be fixed and made cleaner later. 326 if self.args.past_index >= 0:

File /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py:1533, in Module._call_impl(self, *args, *kwargs) 1528 # If we don't have any hooks, we want to skip the rest of the logic in 1529 # this function, and just call forward. It's slow for dynamo to guard on the state 1530 # of all these hook dicts individually, so instead it can guard on 2 bools and we just 1531 # have to promise to keep them up to date when hooks are added or removed via official means. 1532 if not self._has_hooks and not _has_global_hooks: -> 1533 return forward_call(args, **kwargs) 1534 # Do not call functions when jit is used 1535 full_backward_hooks, non_full_backward_hooks = [], []

File /usr/local/lib/python3.8/dist-packages/transformers4rec/torch/model/base.py:560, in Model.forward(self, inputs, targets, training, testing, kwargs) 558 predictions = {} 559 for i, head in enumerate(self.heads): --> 560 head_output = head( 561 inputs, 562 call_body=True, 563 targets=targets, 564 training=training, 565 testing=testing, 566 kwargs, 567 ) 568 labels.update(head_output["labels"]) 569 predictions.update(head_output["predictions"])

File /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py:1533, in Module._call_impl(self, *args, *kwargs) 1528 # If we don't have any hooks, we want to skip the rest of the logic in 1529 # this function, and just call forward. It's slow for dynamo to guard on the state 1530 # of all these hook dicts individually, so instead it can guard on 2 bools and we just 1531 # have to promise to keep them up to date when hooks are added or removed via official means. 1532 if not self._has_hooks and not _has_global_hooks: -> 1533 return forward_call(args, **kwargs) 1534 # Do not call functions when jit is used 1535 full_backward_hooks, non_full_backward_hooks = [], []

File /usr/local/lib/python3.8/dist-packages/transformers4rec/torch/model/base.py:382, in Head.forward(self, body_outputs, training, testing, targets, call_body, top_k, kwargs) 379 from transformers4rec.torch.model.prediction_task import NextItemPredictionTask 381 if call_body: --> 382 body_outputs = self.body(body_outputs, training=training, testing=testing, kwargs) 384 if training or testing: 385 losses = []

File /usr/local/lib/python3.8/dist-packages/transformers4rec/config/schema.py:50, in SchemaMixin.call(self, *args, kwargs) 47 def call(self, *args, *kwargs): 48 self.check_schema() ---> 50 return super().call(args, kwargs)

File /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py:1533, in Module._call_impl(self, *args, *kwargs) 1528 # If we don't have any hooks, we want to skip the rest of the logic in 1529 # this function, and just call forward. It's slow for dynamo to guard on the state 1530 # of all these hook dicts individually, so instead it can guard on 2 bools and we just 1531 # have to promise to keep them up to date when hooks are added or removed via official means. 1532 if not self._has_hooks and not _has_global_hooks: -> 1533 return forward_call(args, **kwargs) 1534 # Do not call functions when jit is used 1535 full_backward_hooks, non_full_backward_hooks = [], []

File /usr/local/lib/python3.8/dist-packages/transformers4rec/torch/block/base.py:256, in SequentialBlock.forward(self, input, training, testing, **kwargs) 254 elif "training" in inspect.signature(module.forward).parameters: 255 if "testing" in inspect.signature(module.forward).parameters: --> 256 input = module(input, training=training, testing=testing) 257 else: 258 input = module(input, training=training)

File /usr/local/lib/python3.8/dist-packages/transformers4rec/config/schema.py:50, in SchemaMixin.call(self, *args, kwargs) 47 def call(self, *args, *kwargs): 48 self.check_schema() ---> 50 return super().call(args, kwargs)

File /usr/local/lib/python3.8/dist-packages/transformers4rec/torch/tabular/base.py:392, in TabularModule.call(self, inputs, pre, post, merge_with, aggregation, *args, *kwargs) 389 inputs = self.pre_forward(inputs, transformations=pre) 391 # This will call the forward method implemented by the super class. --> 392 outputs = super().call(inputs, args, **kwargs) # noqa 394 if isinstance(outputs, dict): 395 outputs = self.post_forward( 396 outputs, transformations=post, merge_with=merge_with, aggregation=aggregation 397 )

File /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py:1533, in Module._call_impl(self, *args, *kwargs) 1528 # If we don't have any hooks, we want to skip the rest of the logic in 1529 # this function, and just call forward. It's slow for dynamo to guard on the state 1530 # of all these hook dicts individually, so instead it can guard on 2 bools and we just 1531 # have to promise to keep them up to date when hooks are added or removed via official means. 1532 if not self._has_hooks and not _has_global_hooks: -> 1533 return forward_call(args, **kwargs) 1534 # Do not call functions when jit is used 1535 full_backward_hooks, non_full_backward_hooks = [], []

File /usr/local/lib/python3.8/dist-packages/transformers4rec/torch/features/sequence.py:262, in TabularSequenceFeatures.forward(self, inputs, training, testing, **kwargs) 259 outputs = self.projection_module(outputs) 261 if self.masking: --> 262 outputs = self.masking( 263 outputs, 264 item_ids=self.to_merge["categorical_module"].item_seq, 265 training=training, 266 testing=testing, 267 ) 269 return outputs

File /usr/local/lib/python3.8/dist-packages/transformers4rec/config/schema.py:50, in SchemaMixin.call(self, *args, kwargs) 47 def call(self, *args, *kwargs): 48 self.check_schema() ---> 50 return super().call(args, kwargs)

File /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py:1533, in Module._call_impl(self, *args, *kwargs) 1528 # If we don't have any hooks, we want to skip the rest of the logic in 1529 # this function, and just call forward. It's slow for dynamo to guard on the state 1530 # of all these hook dicts individually, so instead it can guard on 2 bools and we just 1531 # have to promise to keep them up to date when hooks are added or removed via official means. 1532 if not self._has_hooks and not _has_global_hooks: -> 1533 return forward_call(args, **kwargs) 1534 # Do not call functions when jit is used 1535 full_backward_hooks, non_full_backward_hooks = [], []

File /usr/local/lib/python3.8/dist-packages/transformers4rec/torch/masking.py:223, in MaskSequence.forward(self, inputs, itemids, training, testing) 216 def forward( 217 self, 218 inputs: torch.Tensor, (...) 221 testing: bool = False, 222 ) -> torch.Tensor: --> 223 = self.compute_masked_targets(item_ids=item_ids, training=training, testing=testing) 224 if self.mask_schema is None: 225 raise ValueError("mask_schema must be set.")

File /usr/local/lib/python3.8/dist-packages/transformers4rec/torch/masking.py:149, in MaskSequence.compute_masked_targets(self, item_ids, training, testing) 131 """ 132 Method to prepare masked labels based on the sequence of item ids. 133 It returns The true labels of masked positions and the related boolean mask. (...) 146 Tuple[MaskingSchema, MaskedTargets] 147 """ 148 print(f'Item IDS -> {item_ids.shape}') --> 149 assert item_ids.ndim == 2, "item_ids must have 2 dimensions." 150 masking_info = self._compute_masked_targets(item_ids, training=training, testing=testing) 151 self.mask_schema, self.masked_targets = masking_info.schema, masking_info.targets

AssertionError: item_ids must have 2 dimensions.

vivpra89 commented 1 year ago

@NamartaVij do you mind posting the input and the architecture as well.. can spend sometime looking at the code with the movielens data.

NamartaVij commented 1 year ago

https://github.com/Rajathbharadwaj/NVTabular-Merlin-T4C-ML/tree/main here the link, just go through. It would be great if you can give your feedback as soon. I have a deadline for this actually

rnyak commented 1 year ago

@NamartaVij please share your movielens nvtabular script? how do you generate sequential data from movielens and how do you tag the columns,we need to know first to repro your issue.

you cannot use transfomers4rec without sequential data. you need to generate user sessions with sequential item-ids. please keep that in mind.

NamartaVij commented 1 year ago

@rnyak yes here it is:

rnyak commented 1 year ago

@NamartaVij this wont work. your movieID column is not list. you cannot train a sequential model without sequential data. you need to generate sequential data. Like this if you are predicting the next movie to watch, your movie is your ITEM ID and it should be a sequential data.

session_id     movie_id           
1                    [1, 2, 3]
2                    [2, 5, 7, 8]
3                   [1, 5]
NamartaVij commented 1 year ago

@rnyak yes yes , thankyou I got it.

Apart from this, may I know why we calculate NDCG always, why not diversity

rnyak commented 10 months ago

@NamartaVij we did implement commonly reported evaluation metrics. you can create your own custom metrics as well.