NVIDIA-Merlin / Transformers4Rec

Transformers4Rec is a flexible and efficient library for sequential and session-based recommendation and works with PyTorch.
https://nvidia-merlin.github.io/Transformers4Rec/main
Apache License 2.0
1.11k stars 147 forks source link

[BUG] Bugs in examples/tutorial #589

Closed lendle closed 1 year ago

lendle commented 1 year ago

Bug description

Bug 1 examples/tutorial/03-Session-based-recsys.ipynb, section "3.2.4 Train XLNET with Side Information for Next Item Prediction" , the cell that runs training fails.

Log with stack trace ``` ***** Running training ***** Num examples = 112128 Num Epochs = 3 Instantaneous batch size per device = 256 Total train batch size (w. parallel, distributed & accumulation) = 256 Gradient Accumulation steps = 1 Total optimization steps = 1314 --------------------------------------------------------------------------- RuntimeError Traceback (most recent call last) File :15 File /usr/local/lib/python3.8/dist-packages/transformers/trainer.py:1316, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs) 1314 tr_loss_step = self.training_step(model, inputs) 1315 else: -> 1316 tr_loss_step = self.training_step(model, inputs) 1318 if ( 1319 args.logging_nan_inf_filter 1320 and not is_torch_tpu_available() 1321 and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step)) 1322 ): 1323 # if loss is nan or inf simply add the average of previous logged losses 1324 tr_loss += tr_loss / (1 + self.state.global_step - self._globalstep_last_logged) File /usr/local/lib/python3.8/dist-packages/transformers/trainer.py:1849, in Trainer.training_step(self, model, inputs) 1847 loss = self.compute_loss(model, inputs) 1848 else: -> 1849 loss = self.compute_loss(model, inputs) 1851 if self.args.n_gpu > 1: 1852 loss = loss.mean() # mean() to average on multi-gpu parallel training File /usr/local/lib/python3.8/dist-packages/transformers/trainer.py:1881, in Trainer.compute_loss(self, model, inputs, return_outputs) 1879 else: 1880 labels = None -> 1881 outputs = model(**inputs) 1882 # Save past state if it exists 1883 # TODO: this needs to be fixed and made cleaner later. 1884 if self.args.past_index >= 0: File /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py:1186, in Module._call_impl(self, *input, **kwargs) 1182 # If we don't have any hooks, we want to skip the rest of the logic in 1183 # this function, and just call forward. 1184 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks 1185 or _global_forward_hooks or _global_forward_pre_hooks): -> 1186 return forward_call(*input, **kwargs) 1187 # Do not call functions when jit is used 1188 full_backward_hooks, non_full_backward_hooks = [], [] File /usr/local/lib/python3.8/dist-packages/transformers4rec/torch/trainer.py:830, in HFWrapper.forward(self, *args, **kwargs) 828 def forward(self, *args, **kwargs): 829 inputs = kwargs --> 830 return self.wrapper_module(inputs, *args) File /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py:1186, in Module._call_impl(self, *input, **kwargs) 1182 # If we don't have any hooks, we want to skip the rest of the logic in 1183 # this function, and just call forward. 1184 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks 1185 or _global_forward_hooks or _global_forward_pre_hooks): -> 1186 return forward_call(*input, **kwargs) 1187 # Do not call functions when jit is used 1188 full_backward_hooks, non_full_backward_hooks = [], [] File /usr/local/lib/python3.8/dist-packages/transformers4rec/torch/model/base.py:553, in Model.forward(self, inputs, training, **kwargs) 550 outputs = {} 551 for head in self.heads: 552 outputs.update( --> 553 head(inputs, call_body=True, training=training, always_output_dict=True, **kwargs) 554 ) 556 if len(outputs) == 1: 557 outputs = outputs[list(outputs.keys())[0]] File /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py:1186, in Module._call_impl(self, *input, **kwargs) 1182 # If we don't have any hooks, we want to skip the rest of the logic in 1183 # this function, and just call forward. 1184 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks 1185 or _global_forward_hooks or _global_forward_pre_hooks): -> 1186 return forward_call(*input, **kwargs) 1187 # Do not call functions when jit is used 1188 full_backward_hooks, non_full_backward_hooks = [], [] File /usr/local/lib/python3.8/dist-packages/transformers4rec/torch/model/base.py:398, in Head.forward(self, body_outputs, training, call_body, always_output_dict, ignore_masking, **kwargs) 395 outputs = {} 397 if call_body: --> 398 body_outputs = self.body(body_outputs, training=training, ignore_masking=ignore_masking) 400 for name, task in self.prediction_task_dict.items(): 401 outputs[name] = task( 402 body_outputs, ignore_masking=ignore_masking, training=training, **kwargs 403 ) File /usr/local/lib/python3.8/dist-packages/transformers4rec/config/schema.py:50, in SchemaMixin.__call__(self, *args, **kwargs) 47 def __call__(self, *args, **kwargs): 48 self.check_schema() ---> 50 return super().__call__(*args, **kwargs) File /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py:1186, in Module._call_impl(self, *input, **kwargs) 1182 # If we don't have any hooks, we want to skip the rest of the logic in 1183 # this function, and just call forward. 1184 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks 1185 or _global_forward_hooks or _global_forward_pre_hooks): -> 1186 return forward_call(*input, **kwargs) 1187 # Do not call functions when jit is used 1188 full_backward_hooks, non_full_backward_hooks = [], [] File /usr/local/lib/python3.8/dist-packages/transformers4rec/torch/block/base.py:152, in SequentialBlock.forward(self, input, training, ignore_masking, **kwargs) 150 elif "training" in inspect.signature(module.forward).parameters: 151 if "ignore_masking" in inspect.signature(module.forward).parameters: --> 152 input = module(input, training=training, ignore_masking=ignore_masking) 153 else: 154 input = module(input, training=training) File /usr/local/lib/python3.8/dist-packages/transformers4rec/config/schema.py:50, in SchemaMixin.__call__(self, *args, **kwargs) 47 def __call__(self, *args, **kwargs): 48 self.check_schema() ---> 50 return super().__call__(*args, **kwargs) File /usr/local/lib/python3.8/dist-packages/transformers4rec/torch/tabular/base.py:390, in TabularModule.__call__(self, inputs, pre, post, merge_with, aggregation, *args, **kwargs) 387 inputs = self.pre_forward(inputs, transformations=pre) 389 # This will call the `forward` method implemented by the super class. --> 390 outputs = super().__call__(inputs, *args, **kwargs) # noqa 392 if isinstance(outputs, dict): 393 outputs = self.post_forward( 394 outputs, transformations=post, merge_with=merge_with, aggregation=aggregation 395 ) File /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py:1186, in Module._call_impl(self, *input, **kwargs) 1182 # If we don't have any hooks, we want to skip the rest of the logic in 1183 # this function, and just call forward. 1184 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks 1185 or _global_forward_hooks or _global_forward_pre_hooks): -> 1186 return forward_call(*input, **kwargs) 1187 # Do not call functions when jit is used 1188 full_backward_hooks, non_full_backward_hooks = [], [] File /usr/local/lib/python3.8/dist-packages/transformers4rec/torch/features/sequence.py:257, in TabularSequenceFeatures.forward(self, inputs, training, ignore_masking, **kwargs) 254 outputs = self.aggregation(outputs) 256 if self.projection_module: --> 257 outputs = self.projection_module(outputs) 259 if self.masking and (not ignore_masking or training): 260 outputs = self.masking( 261 outputs, item_ids=self.to_merge["categorical_module"].item_seq, training=training 262 ) File /usr/local/lib/python3.8/dist-packages/transformers4rec/config/schema.py:50, in SchemaMixin.__call__(self, *args, **kwargs) 47 def __call__(self, *args, **kwargs): 48 self.check_schema() ---> 50 return super().__call__(*args, **kwargs) File /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py:1186, in Module._call_impl(self, *input, **kwargs) 1182 # If we don't have any hooks, we want to skip the rest of the logic in 1183 # this function, and just call forward. 1184 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks 1185 or _global_forward_hooks or _global_forward_pre_hooks): -> 1186 return forward_call(*input, **kwargs) 1187 # Do not call functions when jit is used 1188 full_backward_hooks, non_full_backward_hooks = [], [] File /usr/local/lib/python3.8/dist-packages/transformers4rec/torch/block/base.py:148, in SequentialBlock.forward(self, input, training, ignore_masking, **kwargs) 146 if i == len(self) - 1: 147 filtered_kwargs = filter_kwargs(kwargs, module, filter_positional_or_keyword=False) --> 148 input = module(input, **filtered_kwargs) 150 elif "training" in inspect.signature(module.forward).parameters: 151 if "ignore_masking" in inspect.signature(module.forward).parameters: File /usr/local/lib/python3.8/dist-packages/transformers4rec/config/schema.py:50, in SchemaMixin.__call__(self, *args, **kwargs) 47 def __call__(self, *args, **kwargs): 48 self.check_schema() ---> 50 return super().__call__(*args, **kwargs) File /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py:1186, in Module._call_impl(self, *input, **kwargs) 1182 # If we don't have any hooks, we want to skip the rest of the logic in 1183 # this function, and just call forward. 1184 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks 1185 or _global_forward_hooks or _global_forward_pre_hooks): -> 1186 return forward_call(*input, **kwargs) 1187 # Do not call functions when jit is used 1188 full_backward_hooks, non_full_backward_hooks = [], [] File /usr/local/lib/python3.8/dist-packages/transformers4rec/torch/block/base.py:156, in SequentialBlock.forward(self, input, training, ignore_masking, **kwargs) 154 input = module(input, training=training) 155 else: --> 156 input = module(input) 158 return input File /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py:1186, in Module._call_impl(self, *input, **kwargs) 1182 # If we don't have any hooks, we want to skip the rest of the logic in 1183 # this function, and just call forward. 1184 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks 1185 or _global_forward_hooks or _global_forward_pre_hooks): -> 1186 return forward_call(*input, **kwargs) 1187 # Do not call functions when jit is used 1188 full_backward_hooks, non_full_backward_hooks = [], [] File /usr/local/lib/python3.8/dist-packages/torch/nn/modules/linear.py:114, in Linear.forward(self, input) 113 def forward(self, input: Tensor) -> Tensor: --> 114 return F.linear(input, self.weight, self.bias) RuntimeError: expected scalar type Float but found Double ```

I believe this is because the product_recency_days_log_norm-list_seq created in the prior notebook (02-ETL-with-NVTabular) is float64 rather than float32. I was able to get things to run by adding >> nvt.ops.ReduceDtypeSize() to the cell where that feature is defined in the prior notebook, section 5.3. I'm not sure if this is the correct fix though.

Bug 2

XLNet-MLM with side information accuracy results that get written to results.txt in 03-Session-based-recsys should have metric name and values separated by : rather than space. Metrics from the other two models trained in the notebook are written correctly. This causes the call to create_bar_chart('results.txt') to fail.

Easy fix,

with open("results.txt", 'a') as f:
    f.write('\n')
    f.write('XLNet-MLM with side information accuracy results:')
    f.write('\n')
    for key, value in  model.compute_metrics().items(): 
        f.write('%s %s\n' % (key, value.item()))

should have f.write('%s:%s\n' % (key, value.item())) in the last line.

Steps/Code to reproduce bug

Run the tutorial notebooks.

Expected behavior

Environment details

Google Cloud Workbench managed notebook with image version nvcr.io/nvidia/merlin/merlin-pytorch:22.11

Machine info: a2-highgpu-1g (Accelerator Optimized: 1 NVIDIA Tesla A100 GPU, 12 vCPUs, 85GB RAM)

I'm using version of the example notebooks that are available in the image.

Additional context

rnyak commented 1 year ago

@lendle thanks for reporting that. we'll take a look shortly.

rnyak commented 1 year ago

@lendle I cannot reproduce the first error msg you shared coming from examples/tutorial/03-Session-based-recsys.ipynb, section "3.2.4 Train XLNET with Side Information for Next Item Prediction" . Please note that we already fixed the dtype of the product_recency_days_log_norm-list_seq created in the prior 02-ETL notebook as float32. We do it that way in the notebook: price_log = ['price'] >> nvt.ops.LogOp() >> nvt.ops.Normalize(out_dtype=np.float32) >> nvt.ops.Rename(name='price_log_norm')

you might want to use merlin-pytorch:22.12 docker image for the recent changes, or just fix the line above in your 02-ETL nb.

for the second bug, we'll fix that. thanks.

rnyak commented 1 year ago

closing due to lack of activity.