NVIDIA-Merlin / Transformers4Rec

Transformers4Rec is a flexible and efficient library for sequential and session-based recommendation and works with PyTorch.
https://nvidia-merlin.github.io/Transformers4Rec/main
Apache License 2.0
1.08k stars 142 forks source link

[BUG] getting OutOfMemoryError: CUDA out of memory from trainer.evaluate step in merlin-pytorch 23.04 #693

Closed rnyak closed 1 year ago

rnyak commented 1 year ago

Bug description

I am getting the following error from trainer.evaluate() step when using merlin-pytorch 23.04 image. the same code with the same data trains and evaluates well without any issues when I use 23.02 image.

---------------------------------------------------------------------------
OutOfMemoryError                          Traceback (most recent call last)
Cell In[21], line 2
      1 # eval_data_paths = val_data_paths
----> 2 eval_metrics = trainer.evaluate(metric_key_prefix='eval')

File /usr/local/lib/python3.8/dist-packages/transformers/trainer.py:2113, in Trainer.evaluate(self, eval_dataset, ignore_keys, metric_key_prefix)
   2110 start_time = time.time()
   2112 eval_loop = self.prediction_loop if self.args.use_legacy_prediction_loop else self.evaluation_loop
-> 2113 output = eval_loop(
   2114     eval_dataloader,
   2115     description="Evaluation",
   2116     # No point gathering the predictions if there are no metrics, otherwise we defer to
   2117     # self.args.prediction_loss_only
   2118     prediction_loss_only=True if self.compute_metrics is None else None,
   2119     ignore_keys=ignore_keys,
   2120     metric_key_prefix=metric_key_prefix,
   2121 )
   2123 total_batch_size = self.args.eval_batch_size * self.args.world_size
   2124 output.metrics.update(
   2125     speed_metrics(
   2126         metric_key_prefix,
   (...)
   2130     )
   2131 )

File /usr/local/lib/python3.8/dist-packages/transformers4rec/torch/trainer.py:573, in Trainer.evaluation_loop(self, dataloader, description, prediction_loss_only, ignore_keys, metric_key_prefix)
    564     else:
    565         preds = (
    566             preds_sorted_item_ids,
    567             preds_sorted_item_scores,
    568         )
    570 preds_host = (
    571     preds
    572     if preds_host is None
--> 573     else nested_concat(
    574         preds_host,
    575         preds,
    576     )
    577 )
    579 self.control = self.callback_handler.on_prediction_step(
    580     self.args, self.state, self.control
    581 )
    583 # Gather all tensors and put them back on the CPU
    584 # if we have done enough accumulation steps.

File /usr/local/lib/python3.8/dist-packages/transformers4rec/torch/utils/torch_utils.py:254, in nested_concat(tensors, new_tensors, padding_index)
    250     return type(tensors)(
    251         nested_concat(t, n, padding_index=padding_index) for t, n in zip(tensors, new_tensors)
    252     )
    253 elif isinstance(tensors, torch.Tensor):
--> 254     return torch_pad_and_concatenate(tensors, new_tensors, padding_index=padding_index)
    255 elif isinstance(tensors, Mapping):
    256     return type(tensors)(
    257         {
    258             k: nested_concat(t, new_tensors[k], padding_index=padding_index)
    259             for k, t in tensors.items()
    260         }
    261     )

File /usr/local/lib/python3.8/dist-packages/transformers4rec/torch/utils/torch_utils.py:278, in torch_pad_and_concatenate(tensor1, tensor2, padding_index)
    275 tensor2 = atleast_1d(tensor2)
    277 if len(tensor1.shape) == 1 or tensor1.shape[1] == tensor2.shape[1]:
--> 278     return torch.cat((tensor1, tensor2), dim=0)
    280 # Let's figure out the new shape
    281 new_shape = (
    282     tensor1.shape[0] + tensor2.shape[0],
    283     max(tensor1.shape[1], tensor2.shape[1]),
    284 ) + tensor1.shape[2:]

OutOfMemoryError: CUDA out of memory. Tried to allocate 6.76 GiB (GPU 0; 15.78 GiB total capacity; 7.11 GiB already allocated; 4.54 GiB free; 9.84 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Steps/Code to reproduce bug

import os

import glob

import cudf
import numpy as np
import pandas as pd

import nvtabular as nvt
from nvtabular.ops import *

base_path = './'
saved_model_path = os.path.join(base_path, "saved_model")
model_output_path = os.path.join(base_path, "output")
NUM_ROWS = os.environ.get("NUM_ROWS", 10000000)
long_tailed_item_distribution = np.clip(np.random.lognormal(12., 1., int(NUM_ROWS)).astype(np.int32), 1, 300000)
# generate random item interaction features 
df = pd.DataFrame(np.random.randint(1, 100000, int(NUM_ROWS)), columns=['session_id'])
df['item_id'] = long_tailed_item_distribution
df.to_parquet('df.parquet')

SESSIONS_MAX_LENGTH =16

# Categorify categorical features
categ_feats = ['item_id'] >> nvt.ops.Categorify()

# Define Groupby Workflow
groupby_feats = categ_feats + ['session_id']

# Group interaction features by session
groupby_features = groupby_feats >> nvt.ops.Groupby(
    groupby_cols=["session_id"], 
    aggs={
        "item_id": ["list", "count"],
        },
    name_sep="-")

sequence_features_truncated_item = (
    groupby_features['item_id-list']
    >> nvt.ops.ListSlice(-SESSIONS_MAX_LENGTH, pad=True) 
    >> TagAsItemID()
)  

# Filter out sessions with length 1 (not valid for next-item prediction training and evaluation)
MINIMUM_SESSION_LENGTH = 2
selected_features = (
    groupby_features['item_id-count'] + 
    sequence_features_truncated_item
)

filtered_sessions = selected_features >> nvt.ops.Filter(f=lambda df: df["item_id-count"] >= MINIMUM_SESSION_LENGTH)

seq_feats_list = filtered_sessions['item_id-list'] >>  nvt.ops.ValueCount()

workflow = nvt.Workflow(seq_feats_list)
dataset = nvt.Dataset(df)

workflow.fit_transform(dataset).to_parquet(os.path.join(base_path, "processed_nvt"))

from transformers4rec.torch.utils.data_utils import MerlinDataLoader
from transformers4rec import torch as tr
from transformers4rec.torch.ranking_metric import NDCGAt, AvgPrecisionAt, RecallAt, PrecisionAt, MeanReciprocalRankAt
from transformers4rec.torch import Trainer
from transformers4rec.config.trainer import T4RecTrainingArguments

from merlin.schema import Schema
from merlin.io import Dataset

train = Dataset(os.path.join(base_path, "processed_nvt/part_0.parquet"))
schema = train.schema
transformed_train_data_path = './processed_nvt/'

INPUTS_d_output = 100 
INPUTS_embedding_dim_default = 64
BODY_MLPBlock_units = 64 
MODEL_d_model = 64
MODEL_n_head = 4
MODEL_n_layer = 3

TRAINING_ARG_learning_rate = 0.0001#0.001#
TRAINING_ARG_lr_scheduler_type = "cosine"
TRAINING_ARG_learning_rate_num_cosine_cycles_by_epoch = 0.25
TRAINING_ARG_eval_steps = 6_500
TRAINING_ARGS_report_to = "wandb" #["tensorboard"]
wandb_project = "t4rec-sweep"

training_batch_size = 1024
eval_batch_size = 128
use_sampled_softmax = True
SAMPLED_SOFTMAX_number_negatives_uniform = training_batch_size*5
at_top_ks = [1, 5, 10]

SESSIONS_MAX_LENGTH = 16
INPUTS_masking = "mlm"

inputs = tr.TabularSequenceFeatures.from_schema(
        schema,
        max_sequence_length=SESSIONS_MAX_LENGTH,
        # continuous_projection=64,
        masking=INPUTS_masking,
        d_output=INPUTS_d_output,
        embedding_dim_default = INPUTS_embedding_dim_default)
################################################################################
# Define XLNetConfig class and set default parameters for HF XLNet config  
transformer_config = tr.XLNetConfig.build(
    d_model=MODEL_d_model, n_head=MODEL_n_head, n_layer=MODEL_n_layer, total_seq_length=SESSIONS_MAX_LENGTH
)  
# Define the model block including: inputs, masking, projection and transformer block.
body = tr.SequentialBlock(
    inputs, 
    tr.MLPBlock([MODEL_d_model]), 
    tr.TransformerBlock(transformer_config, masking=inputs.masking))

# Define the evaluation top-N metrics and the cut-offs    
metrics = [PrecisionAt(top_ks=at_top_ks, labels_onehot=True),
           RecallAt(top_ks=at_top_ks, labels_onehot=True),
           NDCGAt(top_ks=at_top_ks, labels_onehot=True),  
           MeanReciprocalRankAt(top_ks=at_top_ks, labels_onehot=True)]

if use_sampled_softmax:
    prediction_task = tr.NextItemPredictionTask(
        sampled_softmax = True,
        max_n_samples=SAMPLED_SOFTMAX_number_negatives_uniform,
        weight_tying=True, 
        metrics=metrics)    
else:
    prediction_task = tr.NextItemPredictionTask(
        weight_tying=True, 
        metrics=metrics)       

head = tr.Head(
    body,
    prediction_task,
    inputs=inputs,
)

model = tr.Model(head)

train_args = T4RecTrainingArguments(
                                    # overwrite_output_dir = True, # – If True, overwrite the content of the output directory. Use this to continue training if output_dir points to a checkpoint directory.,
                                    data_loader_engine="merlin",
                                    dataloader_drop_last = True,
                                    gradient_accumulation_steps = 1,
                                    per_device_train_batch_size = training_batch_size, 
                                    per_device_eval_batch_size = training_batch_size,
                                    output_dir = model_output_path, 
                                    learning_rate=TRAINING_ARG_learning_rate,
                                    lr_scheduler_type=TRAINING_ARG_lr_scheduler_type, 
                                    learning_rate_num_cosine_cycles_by_epoch=TRAINING_ARG_learning_rate_num_cosine_cycles_by_epoch,
                                    num_train_epochs=1,
                                    max_sequence_length=SESSIONS_MAX_LENGTH,                
                                    no_cuda=False,
                                    do_eval = False, 
                                   logging_steps = 50,
                                   report_to = [],
)

trainer = Trainer(
    model=model,
    args=train_args,
    schema=schema,
    compute_metrics=True,
)

train_data_paths = glob.glob(os.path.join(base_path, 'processed_nvt/', "part_0.parquet"))

trainer.train_dataset_or_path = train_data_paths
trainer.reset_lr_scheduler()
trainer.train()
trainer.state.global_step +=1   

eval_data_paths = glob.glob(os.path.join(base_path, 'processed_nvt/', "part_0.parquet"))
trainer.eval_dataset_or_path = eval_data_paths

eval_metrics = trainer.evaluate(metric_key_prefix='eval')

I installed the main branches on merlin-pytorch:23.02 image where torch version is 1.13.1 and I am still getting the same error..

sararb commented 1 year ago

Based on our discussion, the problem might be caused by this line in the evaluation_loop of T4Rec.

In fact, there is an HF argument called eval_accumulation_steps that we use to determine whether to move the pres_host to CPU after eval_accumulation_steps or not:

  1. If eval_accumulation_steps==None (default value): we will not copy the data to CPU and will continue to add batch predictions to preds_host (leaves on GPU) ==> This ensures a faster evaluation.
  2. If eval_accumulation_steps > 0 : we will move data to CPU and free up GPU memory by setting preds_host to None after each eval_accumulation_steps steps ==> This will result in slower evaluation time.

So with a large item catalog (the use case of this BUG ticket), one may need to experiment with different values of eval_accumulation_steps to find the optimal trade-off between GPU memory and evaluation time.

Xuyike commented 1 year ago

I met the same question. Following your approach can indeed solve the “CUDA out of memory“ problem, but when ”eval_accumulation_steps“ is used, the program will get stuck for a long time after running some steps, and then run again, and eventually it will be automatically killed. What may be the reason for this?

Looking forward to your reply. Thanks!

rnyak commented 1 year ago

I met the same question. Following your approach can indeed solve the “CUDA out of memory“ problem, but when ”eval_accumulation_steps“ is used, the program will get stuck for a long time after running some steps, and then run again, and eventually it will be automatically killed. What may be the reason for this?

Looking forward to your reply. Thanks!

@Xuyike thanks. we observed the same issue, still looking into that, and we will come up with a fix hopefully soon.