serve_reward_model goes down

AtsunoriFujita commented 1 month ago

Describe the bug

When we start serve_reward_model.py and run annotation, the server goes down during processing. It will crash on specific samples. These samples have a long context.

error.log

What we did

We built the source, but the issue has not been solved.
We also tried nvidia/Llama2-13B-SteerLM-RM, but ran into the same issue.
It runs without an issue on nvcr.io/nvidia/nemo:24.05.01 (#219 is the main difference.).
The estimated processing time has also increased from 2 hours (nvcr.io/nvidia/nemo:24.05.01) to 7 hours (nvcr.io/nvidia/nemo:24.07).

Steps/Code to reproduce bug

export HYDRA_FULL_ERROR=1
export MODEL="/workspace/models/Llama3-70B-SteerLM-RM"

python /opt/NeMo-Aligner/examples/nlp/data/steerlm/preprocess_openassistant_data.py --output_directory=data/oasst

python /opt/NeMo-Aligner/examples/nlp/gpt/serve_reward_model.py \
    rm_model_file=${MODEL} \
    trainer.num_nodes=1 \
    trainer.devices=8 \
    ++model.tensor_model_parallel_size=8 \
    ++model.pipeline_model_parallel_size=1 \
    inference.inference_micro_batch_size=2 \
    inference.port=1424

python /opt/NeMo-Aligner/examples/nlp/data/steerlm/attribute_annotate.py \
      --input-file=data/oasst/train.jsonl \
      --output-file=data/oasst/train_labeled.jsonl \
      --port=1424

Before run attribute_annotate.py, you should apply #350

Expected behavior

The process is completed without the server going down.

Environment overview (please complete the following information)

DGX-C A100 * 8
nvcr.io/nvidia/nemo:24.07

Environment details

If NVIDIA docker image is used you don't need to specify these. Otherwise, please provide:

OS version
PyTorch version
Python version

Additional context

Add any other context about the problem here. Example: GPU model

arthrod commented 1 month ago

Could you enforce the micro batches at 2?

AtsunoriFujita commented 3 weeks ago

I tried several patterns for the micro_batch_size, but they didn't solve the issue.

I attached the sample (from oasst) causing the error. error_sample.txt.

No errors occur with nvcr.io/nvidia/nemo:24.05.01.

AtsunoriFujita commented 3 weeks ago

These samples are causing errors in oasst dataset. error_samples.txt

NVIDIA / NeMo-Aligner

serve_reward_model goes down #351