Closed rnyak closed 1 year ago
Based on our discussion, the problem might be caused by this line in the evaluation_loop of T4Rec.
In fact, there is an HF argument called eval_accumulation_steps
that we use to determine whether to move the pres_host to CPU after eval_accumulation_steps or not:
eval_accumulation_steps==None
(default value): we will not copy the data to CPU and will continue to add batch predictions to preds_host
(leaves on GPU) ==> This ensures a faster evaluation.eval_accumulation_steps > 0
: we will move data to CPU and free up GPU memory by setting preds_host to None after each eval_accumulation_steps steps ==> This will result in slower evaluation time.So with a large item catalog (the use case of this BUG ticket), one may need to experiment with different values of eval_accumulation_steps to find the optimal trade-off between GPU memory and evaluation time.
I met the same question. Following your approach can indeed solve the “CUDA out of memory“ problem, but when ”eval_accumulation_steps“ is used, the program will get stuck for a long time after running some steps, and then run again, and eventually it will be automatically killed. What may be the reason for this?
Looking forward to your reply. Thanks!
I met the same question. Following your approach can indeed solve the “CUDA out of memory“ problem, but when ”eval_accumulation_steps“ is used, the program will get stuck for a long time after running some steps, and then run again, and eventually it will be automatically killed. What may be the reason for this?
Looking forward to your reply. Thanks!
@Xuyike thanks. we observed the same issue, still looking into that, and we will come up with a fix hopefully soon.
Bug description
I am getting the following error from
trainer.evaluate()
step when usingmerlin-pytorch 23.04
image. the same code with the same data trains and evaluates well without any issues when I use 23.02 image.Steps/Code to reproduce bug
I installed the main branches on
merlin-pytorch:23.02
image where torch version is1.13.1
and I am still getting the same error..