Closed Nicolinho closed 1 month ago
Hey @Nicolinho, which specific arg is causing an issue with it? I was wondering if we can do this in a more general way, or by adding a model config to rewardbench/models/__init__.py
?
Also, some comments on the SkyWorks dataset soon, it seems like there is some contamination.
@natolambert Both the torch_dtype as well as the device_map gave problems for me. I updated the request to load the model manually via rewardbench/models/init.py You should be able to run it with: `export ACCELERATE_MIXED_PRECISION=bf16; python run_rm.py --model nicolinho/QRM-Llama3.1-8B --trust_remote_code --batch_size 1 --attn_implementation flash_attention_2
@Nicolinho have you tried other models too? Just trying to understand the device map issue on your setup. I do know handling multi-GPU better would help.
Second, if the other code is no longer needed, can you remove it?
Third, can you run make style
and make quality
?
To evaluate the model trained with the skywork dataset you can run
export ACCELERATE_MIXED_PRECISION=bf16; python run_rm.py --model nicolinho/QRM-Llama3.1-8B --trust_remote_code --batch_size 1 --attn_implementation flash_attention_2 --no_model_kwargs
# {'Chat': 0.946927374301676, 'Chat Hard': 0.8991228070175439, 'Safety': 0.922972972972973, 'Reasoning': 0.9578115621760245}
To evaluate the model trained without the skywork dataset and using Llama3 as base you can run
export ACCELERATE_MIXED_PRECISION=bf16; python run_rm.py --model nicolinho/QRM-Llama3-8B --trust_remote_code --batch_size 1 --attn_implementation flash_attention_2
# {'Chat': 0.9581005586592178, 'Chat Hard': 0.8048245614035088, 'Safety': 0.8986486486486487, 'Reasoning': 0.9753028318873792}
Thanks @Nicolinho ! Looks good, should be able to merge this shortly :)
@Nicolinho do I need ACCELERATE_MIXED_PRECISION=bf16;
? I don't like one off ways to run models. I'll try setting the datatype to bfloat16
@natolambert The argument is needed, as the quantile regression head is trained in fp32. Using bfloat16 degrades the performance somewhat.
Hi, could you please evaluate the QRM reward model: https://huggingface.co/nicolinho/QRM-Llama3.1-8B I had to add an argument to the script, such that no model kwargs are given to the model_builder as otherwise it messes with the datatypes. You can run the evaluation with the following command:
Thank you!