Fixed OOM issues when evaluating/predicting

gabrielspmoreira commented 1 year ago

Fixes #693

Goals :soccer:

This PR fixes the issues with OOM when using Trainer.evaluate() or Trainer.predict() on large datasets

Implementation Details :construction:

Trainer.evaluate() or Trainer.predict() methods both call the Trainer.evaluation_loop() method. Besides computing metrics, it also keeps accumulating the predictions (batch size, item cardinality) and labels (batch size) tensors as batches are processed.
When the item cardinality is high, the accumulated predictions become too large very quickly and leads to OOM on CUDA. If T4RecTrainingArguments.eval_accumulation_steps is set, each N steps the accumulated predictions/labels will be moved from the GPU memory to CPU larger memory. But then you might get OOM on CPU too after some steps. For example, for a batch size of 1024 and item cardinality of 300000, each batch would require 1.14 GB of memory, and be accumulated until OOM occurs.
In order to avoid the OOM issues, the T4RecTrainingArguments.predict_top_k argument had been created since the former versions of the library. It limits predictions to accumulate only the topk predictions in memory or in the resulting tensor when you use trainer.predict(). But it was set to None by default, which means that the user would have to be aware of it and set it manually to avoid these issues when evaluating and predicting on large datasets. Note: The predict_top_k does not affect the metrics calculation, but does affects the memory consumption of trainer.evaluate().
This PR does the following changes:
- Sets T4RecTrainingArguments.predict_top_k = 100 as a default value that will avoid OOM issues in most cases, and still returns a reasonable number of top-k predictions in model.predict().
  IMPORTANT: This changes the default output of trainer.predict() API, that returns a PredictionOutput object with a predictions property. Before this change, when the predict_top_k option was not set (default) the predictions property was as 2D tensor (batch size, item cardinality) with the scores for all the items. As now we set T4RecTrainingArguments.predict_top_k by default, the predictions property returns a tuple with (top-100 predicted item ids, top-100 prediction scores) .
- Changes Trainer.evaluation_loop() to fix and clarify the interplay between the T4RecTrainingArguments.predict_top_k option and the model.top_k property. The model.top_k was created recently to allow for the model to return top-k prediction scores/item ids instead of all items, in order to serve T4Rec models more efficiently in Triton.
The model.top_k only returns the top-k items in inference mode (when not training or evaluating), which is the case for both Triton inference and when trainer.predict(). So, setting model.top_k limits to that amount the number of predictions we can get in trainer.predict(). For that reason, we raise now an exception if T4RecTrainingArguments.predict_top_k > model.top_k

Testing Details :mag:

Added test_trainer_predict_top_k_x_top_k to check for all possible combinations between T4RecTrainingArguments.predict_top_k and model.top_k values
Added test_trainer_predict_topk to test if the number of top-k preds matches the T4RecTrainingArguments.predict_top_k and also to ensure the test breaks if the default value changes from 100

github-actions[bot] commented 1 year ago

Documentation preview

https://nvidia-merlin.github.io/Transformers4Rec/review/pr-721

rnyak commented 1 year ago

@gabrielspmoreira the test_stochastic_swap_noise_with_tabular_features is failing. this test is floppy. can we increase the delta abs in the assert of the test to avoid any fails?

NVIDIA-Merlin / Transformers4Rec

Fixed OOM issues when evaluating/predicting #721

Goals :soccer:

Implementation Details :construction:

Testing Details :mag:

Documentation preview