Open AtsunoriFujita opened 1 month ago
Could you enforce the micro batches at 2?
I tried several patterns for the micro_batch_size, but they didn't solve the issue.
I attached the sample (from oasst) causing the error. error_sample.txt.
No errors occur with nvcr.io/nvidia/nemo:24.05.01
.
These samples are causing errors in oasst dataset. error_samples.txt
Describe the bug
When we start
serve_reward_model.py
and run annotation, the server goes down during processing. It will crash on specific samples. These samples have a long context.error.log
What we did
nvidia/Llama2-13B-SteerLM-RM
, but ran into the same issue.nvcr.io/nvidia/nemo:24.05.01
(#219 is the main difference.).nvcr.io/nvidia/nemo:24.05.01
) to 7 hours (nvcr.io/nvidia/nemo:24.07
).Steps/Code to reproduce bug
Before run
attribute_annotate.py
, you should apply #350Expected behavior
The process is completed without the server going down.
Environment overview (please complete the following information)
nvcr.io/nvidia/nemo:24.07
Environment details
If NVIDIA docker image is used you don't need to specify these. Otherwise, please provide:
Additional context
Add any other context about the problem here. Example: GPU model