NVIDIA-Merlin / NVTabular

NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.
Apache License 2.0
1.05k stars 143 forks source link

[BUG] Bugs in examples/tutorial #1738

Closed lendle closed 1 year ago

lendle commented 1 year ago

Describe the bug

Bug 1 examples/tutorial/03-Session-based-recsys.ipynb, section "3.2.4 Train XLNET with Side Information for Next Item Prediction" , the cell that runs training fails.

Log with stack trace ``` ***** Running training ***** Num examples = 112128 Num Epochs = 3 Instantaneous batch size per device = 256 Total train batch size (w. parallel, distributed & accumulation) = 256 Gradient Accumulation steps = 1 Total optimization steps = 1314 --------------------------------------------------------------------------- RuntimeError Traceback (most recent call last) File :15 File /usr/local/lib/python3.8/dist-packages/transformers/trainer.py:1316, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs) 1314 tr_loss_step = self.training_step(model, inputs) 1315 else: -> 1316 tr_loss_step = self.training_step(model, inputs) 1318 if ( 1319 args.logging_nan_inf_filter 1320 and not is_torch_tpu_available() 1321 and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step)) 1322 ): 1323 # if loss is nan or inf simply add the average of previous logged losses 1324 tr_loss += tr_loss / (1 + self.state.global_step - self._globalstep_last_logged) File /usr/local/lib/python3.8/dist-packages/transformers/trainer.py:1849, in Trainer.training_step(self, model, inputs) 1847 loss = self.compute_loss(model, inputs) 1848 else: -> 1849 loss = self.compute_loss(model, inputs) 1851 if self.args.n_gpu > 1: 1852 loss = loss.mean() # mean() to average on multi-gpu parallel training File /usr/local/lib/python3.8/dist-packages/transformers/trainer.py:1881, in Trainer.compute_loss(self, model, inputs, return_outputs) 1879 else: 1880 labels = None -> 1881 outputs = model(**inputs) 1882 # Save past state if it exists 1883 # TODO: this needs to be fixed and made cleaner later. 1884 if self.args.past_index >= 0: File /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py:1186, in Module._call_impl(self, *input, **kwargs) 1182 # If we don't have any hooks, we want to skip the rest of the logic in 1183 # this function, and just call forward. 1184 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks 1185 or _global_forward_hooks or _global_forward_pre_hooks): -> 1186 return forward_call(*input, **kwargs) 1187 # Do not call functions when jit is used 1188 full_backward_hooks, non_full_backward_hooks = [], [] File /usr/local/lib/python3.8/dist-packages/transformers4rec/torch/trainer.py:830, in HFWrapper.forward(self, *args, **kwargs) 828 def forward(self, *args, **kwargs): 829 inputs = kwargs --> 830 return self.wrapper_module(inputs, *args) File /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py:1186, in Module._call_impl(self, *input, **kwargs) 1182 # If we don't have any hooks, we want to skip the rest of the logic in 1183 # this function, and just call forward. 1184 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks 1185 or _global_forward_hooks or _global_forward_pre_hooks): -> 1186 return forward_call(*input, **kwargs) 1187 # Do not call functions when jit is used 1188 full_backward_hooks, non_full_backward_hooks = [], [] File /usr/local/lib/python3.8/dist-packages/transformers4rec/torch/model/base.py:553, in Model.forward(self, inputs, training, **kwargs) 550 outputs = {} 551 for head in self.heads: 552 outputs.update( --> 553 head(inputs, call_body=True, training=training, always_output_dict=True, **kwargs) 554 ) 556 if len(outputs) == 1: 557 outputs = outputs[list(outputs.keys())[0]] File /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py:1186, in Module._call_impl(self, *input, **kwargs) 1182 # If we don't have any hooks, we want to skip the rest of the logic in 1183 # this function, and just call forward. 1184 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks 1185 or _global_forward_hooks or _global_forward_pre_hooks): -> 1186 return forward_call(*input, **kwargs) 1187 # Do not call functions when jit is used 1188 full_backward_hooks, non_full_backward_hooks = [], [] File /usr/local/lib/python3.8/dist-packages/transformers4rec/torch/model/base.py:398, in Head.forward(self, body_outputs, training, call_body, always_output_dict, ignore_masking, **kwargs) 395 outputs = {} 397 if call_body: --> 398 body_outputs = self.body(body_outputs, training=training, ignore_masking=ignore_masking) 400 for name, task in self.prediction_task_dict.items(): 401 outputs[name] = task( 402 body_outputs, ignore_masking=ignore_masking, training=training, **kwargs 403 ) File /usr/local/lib/python3.8/dist-packages/transformers4rec/config/schema.py:50, in SchemaMixin.__call__(self, *args, **kwargs) 47 def __call__(self, *args, **kwargs): 48 self.check_schema() ---> 50 return super().__call__(*args, **kwargs) File /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py:1186, in Module._call_impl(self, *input, **kwargs) 1182 # If we don't have any hooks, we want to skip the rest of the logic in 1183 # this function, and just call forward. 1184 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks 1185 or _global_forward_hooks or _global_forward_pre_hooks): -> 1186 return forward_call(*input, **kwargs) 1187 # Do not call functions when jit is used 1188 full_backward_hooks, non_full_backward_hooks = [], [] File /usr/local/lib/python3.8/dist-packages/transformers4rec/torch/block/base.py:152, in SequentialBlock.forward(self, input, training, ignore_masking, **kwargs) 150 elif "training" in inspect.signature(module.forward).parameters: 151 if "ignore_masking" in inspect.signature(module.forward).parameters: --> 152 input = module(input, training=training, ignore_masking=ignore_masking) 153 else: 154 input = module(input, training=training) File /usr/local/lib/python3.8/dist-packages/transformers4rec/config/schema.py:50, in SchemaMixin.__call__(self, *args, **kwargs) 47 def __call__(self, *args, **kwargs): 48 self.check_schema() ---> 50 return super().__call__(*args, **kwargs) File /usr/local/lib/python3.8/dist-packages/transformers4rec/torch/tabular/base.py:390, in TabularModule.__call__(self, inputs, pre, post, merge_with, aggregation, *args, **kwargs) 387 inputs = self.pre_forward(inputs, transformations=pre) 389 # This will call the `forward` method implemented by the super class. --> 390 outputs = super().__call__(inputs, *args, **kwargs) # noqa 392 if isinstance(outputs, dict): 393 outputs = self.post_forward( 394 outputs, transformations=post, merge_with=merge_with, aggregation=aggregation 395 ) File /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py:1186, in Module._call_impl(self, *input, **kwargs) 1182 # If we don't have any hooks, we want to skip the rest of the logic in 1183 # this function, and just call forward. 1184 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks 1185 or _global_forward_hooks or _global_forward_pre_hooks): -> 1186 return forward_call(*input, **kwargs) 1187 # Do not call functions when jit is used 1188 full_backward_hooks, non_full_backward_hooks = [], [] File /usr/local/lib/python3.8/dist-packages/transformers4rec/torch/features/sequence.py:257, in TabularSequenceFeatures.forward(self, inputs, training, ignore_masking, **kwargs) 254 outputs = self.aggregation(outputs) 256 if self.projection_module: --> 257 outputs = self.projection_module(outputs) 259 if self.masking and (not ignore_masking or training): 260 outputs = self.masking( 261 outputs, item_ids=self.to_merge["categorical_module"].item_seq, training=training 262 ) File /usr/local/lib/python3.8/dist-packages/transformers4rec/config/schema.py:50, in SchemaMixin.__call__(self, *args, **kwargs) 47 def __call__(self, *args, **kwargs): 48 self.check_schema() ---> 50 return super().__call__(*args, **kwargs) File /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py:1186, in Module._call_impl(self, *input, **kwargs) 1182 # If we don't have any hooks, we want to skip the rest of the logic in 1183 # this function, and just call forward. 1184 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks 1185 or _global_forward_hooks or _global_forward_pre_hooks): -> 1186 return forward_call(*input, **kwargs) 1187 # Do not call functions when jit is used 1188 full_backward_hooks, non_full_backward_hooks = [], [] File /usr/local/lib/python3.8/dist-packages/transformers4rec/torch/block/base.py:148, in SequentialBlock.forward(self, input, training, ignore_masking, **kwargs) 146 if i == len(self) - 1: 147 filtered_kwargs = filter_kwargs(kwargs, module, filter_positional_or_keyword=False) --> 148 input = module(input, **filtered_kwargs) 150 elif "training" in inspect.signature(module.forward).parameters: 151 if "ignore_masking" in inspect.signature(module.forward).parameters: File /usr/local/lib/python3.8/dist-packages/transformers4rec/config/schema.py:50, in SchemaMixin.__call__(self, *args, **kwargs) 47 def __call__(self, *args, **kwargs): 48 self.check_schema() ---> 50 return super().__call__(*args, **kwargs) File /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py:1186, in Module._call_impl(self, *input, **kwargs) 1182 # If we don't have any hooks, we want to skip the rest of the logic in 1183 # this function, and just call forward. 1184 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks 1185 or _global_forward_hooks or _global_forward_pre_hooks): -> 1186 return forward_call(*input, **kwargs) 1187 # Do not call functions when jit is used 1188 full_backward_hooks, non_full_backward_hooks = [], [] File /usr/local/lib/python3.8/dist-packages/transformers4rec/torch/block/base.py:156, in SequentialBlock.forward(self, input, training, ignore_masking, **kwargs) 154 input = module(input, training=training) 155 else: --> 156 input = module(input) 158 return input File /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py:1186, in Module._call_impl(self, *input, **kwargs) 1182 # If we don't have any hooks, we want to skip the rest of the logic in 1183 # this function, and just call forward. 1184 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks 1185 or _global_forward_hooks or _global_forward_pre_hooks): -> 1186 return forward_call(*input, **kwargs) 1187 # Do not call functions when jit is used 1188 full_backward_hooks, non_full_backward_hooks = [], [] File /usr/local/lib/python3.8/dist-packages/torch/nn/modules/linear.py:114, in Linear.forward(self, input) 113 def forward(self, input: Tensor) -> Tensor: --> 114 return F.linear(input, self.weight, self.bias) RuntimeError: expected scalar type Float but found Double ```

I believe this is because the product_recency_days_log_norm-list_seq created in the prior notebook (02-ETL-with-NVTabular) is float64 rather than float32. I was able to get things to run by adding >> nvt.ops.ReduceDtypeSize() to the cell where that feature is defined in the prior notebook, section 5.3. I'm not sure if this is the correct fix though.

Bug 2

XLNet-MLM with side information accuracy results that get written to results.txt in 03-Session-based-recsys should have metric name and values separated by : rather than space. Metrics from the other two models trained in the notebook are written correctly. This causes the call to create_bar_chart('results.txt') to fail.

Easy fix,

with open("results.txt", 'a') as f:
    f.write('\n')
    f.write('XLNet-MLM with side information accuracy results:')
    f.write('\n')
    for key, value in  model.compute_metrics().items(): 
        f.write('%s %s\n' % (key, value.item()))

should have f.write('%s:%s\n' % (key, value.item())) in the last line.

Steps/Code to reproduce bug

Run the tutorial notebooks.

Expected behavior A clear and concise description of what you expected to happen.

Environment details (please complete the following information):

Google Cloud Workbench managed notebook with image version nvcr.io/nvidia/merlin/merlin-pytorch:22.11

Machine info: a2-highgpu-1g (Accelerator Optimized: 1 NVIDIA Tesla A100 GPU, 12 vCPUs, 85GB RAM)

I'm using version of the example notebooks that are available in the image.

Additional context Add any other context about the problem here.

lendle commented 1 year ago

I just realized I opened this issue in the wrong repo, should be transformers4rec. Moved: https://github.com/NVIDIA-Merlin/Transformers4Rec/issues/589