allenai / reward-bench

RewardBench: the first evaluation tool for reward models.
https://huggingface.co/spaces/allenai/reward-bench
Apache License 2.0
386 stars 48 forks source link

Best Practice for Including Reward-Bench with Local Modifications in Our Repo #198

Open hank0316 opened 1 week ago

hank0316 commented 1 week ago

Hi Nathan,

I’m currently preparing to release a new repository that contains the code used in my paper. As part of our experiments, we made some slight modifications to the reward-bench code (we're using the v0.1.0-dev version).

The changes include:

  1. Updating the evaluation results to Weights & Biases (wandb).
  2. Loading our model using float16, without load_in_8bit=True (since I'm not sure what load_in_8bit actually does, and I just want to load our RM in fp16).
  3. Adding a system prompt to the message object passed to tokenizer.apply_chat_template.

I'm reaching out to ask about the best practice for including reward-bench with our changes into our repo. At the moment, we’ve removed the .git directory and committed the entire reward-bench codebase to our repository.

Is there a better approach to incorporate reward-bench while maintaining our local modifications? Any advice would be much appreciated!

Thank you for your time and help!

Best regards, Tzu-Han

sanderland commented 1 week ago

Having used the new setup recently, it's all somewhat in flux, but I am personally trying to avoid copying or depending on a fork.

My first thought would be:

natolambert commented 1 week ago

Yeah I mostly agree with @sanderland, but I understand if it's a research project / lesson on using open source code.

Specifics:

  1. I'd happily integrate integration with wandb into main (@vwxyzjn and I have discussed how I am rebuilding wandb with some recent features).
  2. Honestly, I'm not 100% sure either. Pretty crazy to admit this, I need to look. This is what happens when you copy inference code around a ton of times. I need to double check this (I honestly bet a bunch of the args are deprecated). My understanding is that you tend to want to load in the right datatype, then load_in_8bit converts it to 8bit. This is nice for speed, but some models don't quantize well. TLDR: I'm very happy to have help on further improving the datatype handling.
  3. System prompt also great! Most models don't have them, but its coming.

Operationally, it's not that easy to extract changes in a flattened repo. Normally, you want make a new fork and re-apply them. I don't really have much time to do that, but would love to see the additions. I'm sure claude/chatgpt can whip up some quick bash scripts for creating git diff's from a specific commit.

Let me know what you think @hank0316

hank0316 commented 1 week ago

Thank you guys @natolambert and @sanderland for the replies.

My thoughts:

  1. wandb: Currently, I use wandb mainly to upload evaluation results for each subset. While integrating wandb might not be essential for releasing my code—since users can access evaluation results through stdout or the results folder—I'm open to assisting further if needed. Please let me know how I can contribute to enhancing this aspect.
  2. model loading args: In my implementation, I specify torch_dtype=torch.float16 and set load_in_8bit=False when loading the RM, since I didn't use 8-bit quantization. After a quick review of the latest version of reward-bench, it seems we can continue with the existing loading logic without any modifications. However, I'm available to discuss potential refinements to ensure alignment with any new updates.
  3. System prompt: I'm wiling to contribute to this great evaluation tool. Maybe I can issue a PR to add system prompt as an optional feature used for inference?

I appreciate your guidance on these modifications. Please let me know if there are specific procedures or additional insights required for integrating these changes.

Thanks again for your time and help!

natolambert commented 6 days ago

Yup @hank0316 opening PR(s) is best. I'll provide feedback from there.