NVIDIA-Merlin / Transformers4Rec

Transformers4Rec is a flexible and efficient library for sequential and session-based recommendation and works with PyTorch.
https://nvidia-merlin.github.io/Transformers4Rec/main
Apache License 2.0
1.07k stars 142 forks source link

[BUG] 'Trainer' object has no attribute '_pad_across_processes' #743

Closed SPP3000 closed 10 months ago

SPP3000 commented 11 months ago

Bug description

Executing the example code of 02-End-to-end-session-based-with-Yoochoose-PyT.ipynb leads to an error in fit_and_evaluate(...) as the trainer cannot find the method '_pad_across_processes'. Is this a bug and if not where does this method come from and why is it not found? Training seems to work without problems, but as soon as the evaluation starts, this problem is raised.

Steps/Code to reproduce bug

  1. Installing all necessary libraries over dpkg, apt and pip
  2. Downloading 01-ETL-with-NVTabular.ipynb, 02-End-to-end-session-based-with-Yoochoose-PyT.ipynb and the dataset
  3. Executing 01-ETL-with-NVTabular.ipynb
  4. Executing 02-End-to-end-session-based-with-Yoochoose-PyT.ipynb which leads to the reported behavior

Expected behavior

Output similar to the jupter notebook as seen on the git repository

Environment details

Additional context

AttributeError                            Traceback (most recent call last)
Cell In[10], line 4
      2 start_time_idx = int(os.environ.get("START_TIME_INDEX", "178"))
      3 end_time_idx = int(os.environ.get("END_TIME_INDEX", "180"))
----> 4 OT_results = fit_and_evaluate(recsys_trainer, start_time_index=start_time_idx, end_time_index=end_time_idx, input_dir=OUTPUT_DIR)

File ~/PycharmProjects/transformer/venv/lib/python3.10/site-packages/transformers4rec/torch/utils/examples_utils.py:81, in fit_and_evaluate(trainer, start_time_index, end_time_index, input_dir)
     79 # 3. Evaluate on valid data of time_index+1
     80 trainer.eval_dataset_or_path = eval_paths
---> 81 eval_metrics = trainer.evaluate(metric_key_prefix="eval")
     82 print("\n***** Evaluation results for day %s:*****\n" % time_index_eval)
     83 for key in sorted(eval_metrics.keys()):

File ~/PycharmProjects/transformer/venv/lib/python3.10/site-packages/transformers/trainer.py:2972, in Trainer.evaluate(self, eval_dataset, ignore_keys, metric_key_prefix)
   2969 start_time = time.time()
   2971 eval_loop = self.prediction_loop if self.args.use_legacy_prediction_loop else self.evaluation_loop
-> 2972 output = eval_loop(
   2973     eval_dataloader,
   2974     description="Evaluation",
   2975     # No point gathering the predictions if there are no metrics, otherwise we defer to
   2976     # self.args.prediction_loss_only
   2977     prediction_loss_only=True if self.compute_metrics is None else None,
   2978     ignore_keys=ignore_keys,
   2979     metric_key_prefix=metric_key_prefix,
   2980 )
   2982 total_batch_size = self.args.eval_batch_size * self.args.world_size
   2983 if f"{metric_key_prefix}_jit_compilation_time" in output.metrics:

File ~/PycharmProjects/transformer/venv/lib/python3.10/site-packages/transformers4rec/torch/trainer.py:524, in Trainer.evaluation_loop(self, dataloader, description, prediction_loss_only, ignore_keys, metric_key_prefix)
    520     losses_host = (
    521         losses if losses_host is None else torch.cat((losses_host, losses), dim=0)
    522     )
    523 if labels is not None:
--> 524     labels = self._pad_across_processes(labels)
    525     labels = self._nested_gather(labels)
    526     labels_host = (
    527         labels
    528         if labels_host is None
    529         else nested_concat(labels_host, labels, padding_index=0)
    530     )

AttributeError: 'Trainer' object has no attribute '_pad_across_processes'
SPP3000 commented 11 months ago

I think this is related: https://github.com/huggingface/transformers/commit/f1732e1374a082bf8e43bd0e4aa8a2da21a32a21

rnyak commented 11 months ago

@SPP3000 can you pls tell us how do you install TF4Rec and other merlin lib? Recommended way is to use merlin-pytorch:23.06 image, or if you are doing pip installation please be sure you comply with the transformers version in the requirements here, for exp, try 4.12.

SPP3000 commented 11 months ago

I installed it in a python virtual environment without the use of any images.

pip install transformers4rec and pip install cudf-cu11 dask-cudf-cu11 --extra-index-url=https://pypi.nvidia.com

After executing the example code, I got some warnings about missing dependencies, such as tensorflow. I installed those as well over pip.

The problem I was facing, is that my installation gave me the newest version of the huggingface transformers, for which a the private function _pad_across_processes() has been removed from their Trainer class.

You might want to account for this in your upcoming releases. For now I copy and pasted the missing code part into the Trainer class.

    # Copied from Accelerate.
    def _pad_across_processes(self, tensor, pad_index=-100):
        """
        Recursively pad the tensors in a nested list/tuple/dictionary of tensors from all devices to the same size so
        they can safely be gathered.
        """
        if isinstance(tensor, (list, tuple)):
            return type(tensor)(self._pad_across_processes(t, pad_index=pad_index) for t in tensor)
        elif isinstance(tensor, dict):
            return type(tensor)({k: self._pad_across_processes(v, pad_index=pad_index) for k, v in tensor.items()})
        elif not isinstance(tensor, torch.Tensor):
            raise TypeError(
                f"Can't pad the values of type {type(tensor)}, only of nested list/tuple/dicts of tensors."
            )

        if len(tensor.shape) < 2:
            return tensor
        # Gather all sizes
        size = torch.tensor(tensor.shape, device=tensor.device)[None]
        sizes = self._nested_gather(size).cpu()

        max_size = max(s[1] for s in sizes)
        # When extracting XLA graphs for compilation, max_size is 0,
        # so use inequality to avoid errors.
        if tensor.shape[1] >= max_size:
            return tensor

        # Then pad to the maximum size
        old_size = tensor.shape
        new_size = list(old_size)
        new_size[1] = max_size
        new_tensor = tensor.new_zeros(tuple(new_size)) + pad_index
        new_tensor[:, : old_size[1]] = tensor
        return new_tensor
rnyak commented 11 months ago

After executing the example code, I got some warnings about missing dependencies, such as tensorflow. I installed those as well over pip.

You do not need to install tensorflow for a pytorch workflow. Thanks for the code.

SPP3000 commented 11 months ago

You do not need to install tensorflow for a pytorch workflow. Thanks for the code.

Ok good to know!

From my side everything is clear and if wanted we can close the issue anytime.

Victor055 commented 10 months ago

I installed it in a python virtual environment without the use of any images.

pip install transformers4rec and pip install cudf-cu11 dask-cudf-cu11 --extra-index-url=https://pypi.nvidia.com

After executing the example code, I got some warnings about missing dependencies, such as tensorflow. I installed those as well over pip.

The problem I was facing, is that my installation gave me the newest version of the huggingface transformers, for which a the private function _pad_across_processes() has been removed from their Trainer class.

You might want to account for this in your upcoming releases. For now I copy and pasted the missing code part into the Trainer class.

    # Copied from Accelerate.
    def _pad_across_processes(self, tensor, pad_index=-100):
        """
        Recursively pad the tensors in a nested list/tuple/dictionary of tensors from all devices to the same size so
        they can safely be gathered.
        """
        if isinstance(tensor, (list, tuple)):
            return type(tensor)(self._pad_across_processes(t, pad_index=pad_index) for t in tensor)
        elif isinstance(tensor, dict):
            return type(tensor)({k: self._pad_across_processes(v, pad_index=pad_index) for k, v in tensor.items()})
        elif not isinstance(tensor, torch.Tensor):
            raise TypeError(
                f"Can't pad the values of type {type(tensor)}, only of nested list/tuple/dicts of tensors."
            )

        if len(tensor.shape) < 2:
            return tensor
        # Gather all sizes
        size = torch.tensor(tensor.shape, device=tensor.device)[None]
        sizes = self._nested_gather(size).cpu()

        max_size = max(s[1] for s in sizes)
        # When extracting XLA graphs for compilation, max_size is 0,
        # so use inequality to avoid errors.
        if tensor.shape[1] >= max_size:
            return tensor

        # Then pad to the maximum size
        old_size = tensor.shape
        new_size = list(old_size)
        new_size[1] = max_size
        new_tensor = tensor.new_zeros(tuple(new_size)) + pad_index
        new_tensor[:, : old_size[1]] = tensor
        return new_tensor

Hi, sorry I'm starting to use Transformers4Rec. My solution was install the follow based on your conclusion. I didn't find any solution in other sites, Thanks for sharing your knowledge.

!pip install --upgrade accelerate !pip install transformers==4.28.0

SPP3000 commented 10 months ago

Thank you for providing an alternative solution that does not require manipulating code in the background. I can confirm that this solution works as well.