NVIDIA / NeMo-Aligner

Scalable toolkit for efficient model alignment
Apache License 2.0
628 stars 78 forks source link

serve_reward_model goes down #351

Open AtsunoriFujita opened 1 month ago

AtsunoriFujita commented 1 month ago

Describe the bug

When we start serve_reward_model.py and run annotation, the server goes down during processing. It will crash on specific samples. These samples have a long context.

error.log

What we did

Steps/Code to reproduce bug

export HYDRA_FULL_ERROR=1
export MODEL="/workspace/models/Llama3-70B-SteerLM-RM"

python /opt/NeMo-Aligner/examples/nlp/data/steerlm/preprocess_openassistant_data.py --output_directory=data/oasst

python /opt/NeMo-Aligner/examples/nlp/gpt/serve_reward_model.py \
    rm_model_file=${MODEL} \
    trainer.num_nodes=1 \
    trainer.devices=8 \
    ++model.tensor_model_parallel_size=8 \
    ++model.pipeline_model_parallel_size=1 \
    inference.inference_micro_batch_size=2 \
    inference.port=1424

python /opt/NeMo-Aligner/examples/nlp/data/steerlm/attribute_annotate.py \
      --input-file=data/oasst/train.jsonl \
      --output-file=data/oasst/train_labeled.jsonl \
      --port=1424

Before run attribute_annotate.py, you should apply #350

Expected behavior

The process is completed without the server going down.

Environment overview (please complete the following information)

Environment details

If NVIDIA docker image is used you don't need to specify these. Otherwise, please provide:

Additional context

Add any other context about the problem here. Example: GPU model

arthrod commented 1 month ago

Could you enforce the micro batches at 2?

AtsunoriFujita commented 3 weeks ago

I tried several patterns for the micro_batch_size, but they didn't solve the issue.

I attached the sample (from oasst) causing the error. error_sample.txt.

No errors occur with nvcr.io/nvidia/nemo:24.05.01.

AtsunoriFujita commented 3 weeks ago

These samples are causing errors in oasst dataset. error_samples.txt