allenai / reward-bench

RewardBench: the first evaluation tool for reward models.
https://huggingface.co/spaces/allenai/reward-bench
Apache License 2.0
442 stars 52 forks source link

Evaluate QRM reward models #195

Closed Nicolinho closed 1 month ago

Nicolinho commented 1 month ago

Hi, could you please evaluate the QRM reward model: https://huggingface.co/nicolinho/QRM-Llama3.1-8B I had to add an argument to the script, such that no model kwargs are given to the model_builder as otherwise it messes with the datatypes. You can run the evaluation with the following command:

export ACCELERATE_MIXED_PRECISION=bf16; python run_rm.py  --model nicolinho/QRM-Llama3.1-8B --trust_remote_code --batch_size 1 --attn_implementation flash_attention_2 --no_model_kwargs
# {'Chat': 0.946927374301676, 'Chat Hard': 0.8991228070175439, 'Safety': 0.922972972972973, 'Reasoning': 0.9578115621760245}

Thank you!

natolambert commented 1 month ago

Hey @Nicolinho, which specific arg is causing an issue with it? I was wondering if we can do this in a more general way, or by adding a model config to rewardbench/models/__init__.py?

Also, some comments on the SkyWorks dataset soon, it seems like there is some contamination.

Nicolinho commented 1 month ago

@natolambert Both the torch_dtype as well as the device_map gave problems for me. I updated the request to load the model manually via rewardbench/models/init.py You should be able to run it with: `export ACCELERATE_MIXED_PRECISION=bf16; python run_rm.py --model nicolinho/QRM-Llama3.1-8B --trust_remote_code --batch_size 1 --attn_implementation flash_attention_2

natolambert commented 1 month ago

@Nicolinho have you tried other models too? Just trying to understand the device map issue on your setup. I do know handling multi-GPU better would help.

Second, if the other code is no longer needed, can you remove it?

Third, can you run make style and make quality?

Nicolinho commented 1 month ago
  1. I did not try other models.
  2. I removed the code that is no longer needed.
  3. I updated the code style and quality.

To evaluate the model trained with the skywork dataset you can run

export ACCELERATE_MIXED_PRECISION=bf16; python run_rm.py  --model nicolinho/QRM-Llama3.1-8B --trust_remote_code --batch_size 1 --attn_implementation flash_attention_2 --no_model_kwargs
# {'Chat': 0.946927374301676, 'Chat Hard': 0.8991228070175439, 'Safety': 0.922972972972973, 'Reasoning': 0.9578115621760245}

To evaluate the model trained without the skywork dataset and using Llama3 as base you can run

export ACCELERATE_MIXED_PRECISION=bf16; python run_rm.py  --model nicolinho/QRM-Llama3-8B --trust_remote_code --batch_size 1 --attn_implementation flash_attention_2 
# {'Chat': 0.9581005586592178, 'Chat Hard': 0.8048245614035088, 'Safety': 0.8986486486486487, 'Reasoning': 0.9753028318873792}
natolambert commented 1 month ago

Thanks @Nicolinho ! Looks good, should be able to merge this shortly :)

natolambert commented 1 month ago

@Nicolinho do I need ACCELERATE_MIXED_PRECISION=bf16;? I don't like one off ways to run models. I'll try setting the datatype to bfloat16

Nicolinho commented 1 month ago

@natolambert The argument is needed, as the quantile regression head is trained in fp32. Using bfloat16 degrades the performance somewhat.