rewardbench.py results are different for different batch size for beaver-7b

andrewsiah commented 1 month ago

Thank you for the great work on rewardbench, as it's been super helpful in evaluating/researching reward models.

I've been wrapping your rewardbench.py code to run the reward models published on the leaderboard.

I noticed however that my reward scores are different when my batch sizes are different. EG here the only difference is batch_size= 1,2,3

Eg: Running rewardbench --model=PKU-Alignment/beaver-7b-v1.0-cost --dataset=allenai/ultrafeedback_binarized_cleaned --split=test_gen --chat_template=raw --save_all --batch_size=1

rewardbench --model=PKU-Alignment/beaver-7b-v1.0-cost --dataset=allenai/ultrafeedback_binarized_cleaned --split=test_gen --chat_template=raw --save_all --batch_size=2

rewardbench --model=PKU-Alignment/beaver-7b-v1.0-cost --dataset=allenai/ultrafeedback_binarized_cleaned --split=test_gen --chat_template=raw --save_all --batch_size=3

results in these different scores:

0.429
0.476
0.480

Screenshot 2024-06-04 at 7 11 14 PM

Please help, I've been trying to debug and I think it's to do with the modelpipeline in itself? Because I tracked to make sure the texts that goes in are the same, but when batch sizes are different, the outputscores are different.

Does it have to do with padding or truncation? I made sure the max_length are the same.

@natolambert

andrewsiah commented 1 month ago

This matters because sometimes when we're ranking two (prompt, response) pairs, the output difference is large enough that the ordinal preference changes.

Thank you.

natolambert commented 1 month ago

Hey @andrewsiah can you confirm this works for other models? We want to make sure this is isolated to this model.

When it comes to the beaver models, the only code I added is actually https://github.com/allenai/reward-bench/blob/e59cf242c316f18f73d77568653f56e99255658e/rewardbench/models/beaver.py#L482C1-L504C35

The rest is copied directly from the safe RLHF repo. https://github.com/PKU-Alignment/safe-rlhf

natolambert commented 1 month ago

To be clear, the beaver models aren't designed to be used at inference like this, so if it's isolated to this model it is interesting but not surprising to me.

They're designed to be used for a training signal, so maybe there is some uncertainty built in at inference. Does it return the same score on multiple runs if we change the seed?

natolambert commented 1 month ago

Documenting my results:

rewardbench --model=Qwen/Qwen1.5-0.5B-Chat --dataset=allenai/ultrafeedback_binarized_cleaned --split=test_gen --chat_template=raw --save_all --batch_size=1
2024-06-04 17:04:15 - INFO - rewardbench.rewardbench - Results: 0.49949238578680205, on 985 prompts

rewardbench --model=Qwen/Qwen1.5-0.5B-Chat --dataset=allenai/ultrafeedback_binarized_cleaned --split=test_gen --chat_template=raw --save_all --batch_size=2
2024-06-04 17:10:13 - INFO - rewardbench.rewardbench - Results: 0.5553299492385787, on 985 prompts

rewardbench --model=Qwen/Qwen1.5-0.5B-Chat --dataset=allenai/ultrafeedback_binarized_cleaned --split=test_gen --chat_template=raw --save_all --batch_size=2
2024-06-04 17:13:02 - INFO - rewardbench.rewardbench - Results: 0.5289340101522843, on 985 prompts

May be an issue with accelerate.prepare() https://github.com/huggingface/accelerate/issues/2316 on bigger datasets, noise should go down, let me check. Either way, I can remove that line of code.

natolambert commented 1 month ago

Note this is a different model. I'm guessing this is related to padding, where different reward models handle pad tokens differently.

andrewsiah commented 1 month ago

This is for RLHFlow/RewardModel-Mistral-7B-for-DPA-v1

Running

rewardbench --model=RLHFlow/RewardModel-Mistral-7B-for-DPA-v1 --dataset=allenai/ultrafeedback_binarized_cleaned --split=test_gen --chat_template=raw --save_all --batch_size=1
rewardbench --model=RLHFlow/RewardModel-Mistral-7B-for-DPA-v1 --dataset=allenai/ultrafeedback_binarized_cleaned --split=test_gen --chat_template=raw --save_all --batch_size=2
rewardbench --model=RLHFlow/RewardModel-Mistral-7B-for-DPA-v1 --dataset=allenai/ultrafeedback_binarized_cleaned --split=test_gen --chat_template=raw --save_all --batch_size=3

The results are:

0.556
0.6233
0.6223 (different)

Screenshot 2024-06-04 at 8 36 59 PM

andrewsiah commented 1 month ago

This is for OpenAssistant/reward-model-deberta-v3-large-v2

Screenshot 2024-06-04 at 8 50 39 PM

Batch_size = [1,2,3]

Results are different as well.

@natolambert

andrewsiah commented 1 month ago

I think it might not be due to accelerate. Cause I wrapped your pipeline in our code without using accelerate (I wrote a custom multiprocess pipeline), and the reward difference is still there.

andrewsiah commented 1 month ago

Unless model_pipeline from transformers uses accelerate intrinsically? Am unsure.

natolambert commented 1 month ago

I think it's padding. I'm testing with

rewardbench --model=OpenAssistant/oasst-rm-2.1-pythia-1.4b-epoch-2.5 --dataset=allenai/ultrafeedback_binarized_cleaned --split=test_gen --chat_template=oasst_pythia --save_all --batch_size=5

The changes (pushing to #138 )

        # padding experiments for determinism
        tokenizer.padding_sid = "left"
        tokenizer.truncation_side = "left"

Which makes sense, as we already needed this to make models score correctly.

        # if using fastchat template (no template in tokenizer), make the RM tokenizer output an EOS token
        if not check_tokenizer_chat_template(tokenizer):
            reward_pipe.tokenizer.add_eos_token = True

Results with change:

BS 16: 0.5776649746192893
BS 20: 0.582741116751269
BS 5: 0.581725888324873

Results without change:

BS 16: 0.5928934010152285
BS 20: 0.5857868020304569
BS 5: 0.5959390862944163

Seems like that's not it, but trying on one of the models you just shared.

natolambert commented 1 month ago

Yeah padding changes didn't help --model=OpenAssistant/reward-model-deberta-v3-large-v2 I feel like we don't know the smallest basics of how to use RMs, which was part of this project, but wish I dug into tokens earlier than this.

natolambert commented 1 month ago

I really think it is padding related but hard to spin up a minimal example. When reward models are trained, they're getting tokens in a very different manner than we are handing them now. So, we need to make sure that inference is static under the addition of pad tokens (which wouldn't be that surprising if they're not).

natolambert commented 1 month ago

Trying the simple thing, "padding": False Not sure the solution for DPO, but handle that later ;)

natolambert commented 1 month ago

OpenAssistant/reward-model-deberta-v3-large-v2 no padding, with the correct chat template --chat_template=oasst_pythia

BS 1 0.6
BS 2 0.6253807106598985
BS 3 0.6487309644670051
BS 5 0.6548223350253807
BS 20 0.6416243654822334
BS 24 0.6649746192893401

Better but not completely there?

natolambert commented 1 month ago

Intuitively, if the models are trained in different ways, Padding = False, Truncation = Left, is the most intuitive, but there is still variance.

natolambert commented 1 month ago

Note: This is both for models with chat template in the tokenizer and fast chat template, so that's not the cause (most likely)

natolambert commented 1 month ago

@ValentinaPy is checking how this impacts benchmark scores.

natolambert commented 1 month ago

Relevant to padding: AutoModelForSequenceClassification takes the score of the first padding index. Left padding shouldn't work then. https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L1376

Here are some examples of one reward model, no padding, with different batches.

ipdb> reward_pipe(batch["text_chosen"], **reward_pipeline_kwargs)
[INFO|base.py:1170] 2024-06-04 19:06:03,405 >> Disabling tokenizer parallelism, we're using DataLoader multithreading already
[{'label': 'LABEL_0', 'score': -2.029296875}, {'label': 'LABEL_0', 'score': -2.755859375}, {'label': 'LABEL_0', 'score': -2.650390625}, {'label': 'LABEL_0', 'score': -2.76171875}, {'label': 'LABEL_0', 'score': -2.619140625}]
ipdb> reward_pipe(batch["text_chosen"][0], **reward_pipeline_kwargs)
[{'label': 'LABEL_0', 'score': -1.998046875}]
ipdb> reward_pipe([batch["text_chosen"][0],]*5, **reward_pipeline_kwargs)
[{'label': 'LABEL_0', 'score': -1.9755859375}, {'label': 'LABEL_0', 'score': -2.712890625}, {'label': 'LABEL_0', 'score': -2.59375}, {'label': 'LABEL_0', 'score': -2.7421875}, {'label': 'LABEL_0', 'score': -2.568359375}]

The last number is particularly damning, different numbers for the same input.

vwxyzjn commented 1 month ago

Could this be a dropout issues? You can try

def disable_dropout_in_model(model: torch.nn.Module) -> None:
    for module in model.modules():
        if isinstance(module, torch.nn.Dropout):
            module.p = 0

andrewsiah commented 1 month ago

I did set model.eval(), which would disable dropout? Granted I didn't test it on the side.

natolambert commented 1 month ago

Update: We've learned that Deberta models do not work with batch sizes >1. We've confirmed that my new pipeline works deterministically for pythia RMs.

E.g. when passing in the same prompt 5 times.

GPTNeoXRewardModelOutput(logits=tensor([[-1.1133],
        [-1.1133],
        [-1.1133],
        [-1.1133],
        [-1.1133]], device='cuda:0', dtype=torch.float16,
       grad_fn=<AddmmBackward0>))

And on Deberta

ipdb> reward_pipe([batch["text_chosen"][0],]*5, **reward_pipeline_kwargs)
tensor([[-1.9756],
        [-2.7129],
        [-2.5938],
        [-2.7422],
        [-2.5684]], device='cuda:0', dtype=torch.float16)

ValentinaPy commented 1 month ago

python scripts/run_rm.py --model=OpenAssistant/reward-model-deberta-v3-large-v2 --chat_template=oasst_pythia BS1: {'Chat': 0.8938547486033519, 'Chat Hard': 0.4517543859649123, 'Safety': 0.7390471042471042, 'Reasoning': 0.3854968079882141} BS5: {'Chat': 0.8603351955307262, 'Chat Hard': 0.5197368421052632, 'Safety': 0.7708249912249912, 'Reasoning': 0.3616146941670759}

also for run_rm.py (but with deberta ahahah)

natolambert commented 1 month ago

@andrewsiah the difference is definitely through something with batching.

I set up things like this with different configurations, and all of them have slightly different results in batch >1 then alone.

        logger.info("Default ===== ===== ===== ===== ===== ===== =====")
        print(reward_pipe(batch["text_chosen"], **reward_pipeline_kwargs))
        out_1 = reward_pipe(batch["text_chosen"][0], **reward_pipeline_kwargs)
        out_2 = reward_pipe(batch["text_chosen"][1], **reward_pipeline_kwargs)
        out_3 = reward_pipe(batch["text_chosen"][2], **reward_pipeline_kwargs)
        print(f"idx 1 {out_1}")
        print(f"idx 2 {out_2}")
        print(f"idx 3 {out_3}")

No configuration of padding/truncation I've used makes them identical. Will try a couple more models.

Some are pretty close, e.g. with the beaver reward model with padding to max length

2024-06-04 20:33:55 - INFO - rewardbench.rewardbench - Default pad max length ===== ===== ===== ===== ===== ===== =====
tensor([[-4.7305],
        [-1.3643],
        [-0.6235],
        [-2.7051],
        [-0.2380]], device='cuda:0')
idx 1 tensor([[-4.7344]], device='cuda:0')
idx 2 tensor([[-1.3662]], device='cuda:0')
idx 3 tensor([[-0.6377]], device='cuda:0')
2024-06-04 20:33:56 - INFO - rewardbench.rewardbench - DEBUGGING left padding/trunc ===== ===== ===== ===== ===== =====
tensor([[-4.7461],
        [-1.4004],
        [ 0.0995],
        [-2.8555],
        [-0.2358]], device='cuda:0')
idx 1 tensor([[-4.7461]], device='cuda:0')
idx 2 tensor([[-1.4219]], device='cuda:0')
idx 3 tensor([[0.0723]], device='cuda:0')

andrewsiah commented 1 month ago

thanks for working on this publicly!

I suspect it might be one of the configs we passed into model_kwargs, or when initializing pipeline_builder.

One possible debug idea is using same prompts across different batch_size (ie what you did), then printing out the tokens (after tokenizer) that are passed in to the model. (by editing the pipeline codebase in our local lib using a print somewhere).

This can then isolate as to whether it's before model.forward (tokenizer) or after model.forward (model setup/config).

natolambert commented 1 month ago

@andrewsiah I don't think it's the config because it's on multiple types of models and I'm hardcoding (and have checked within the pipeline a bit), but I'm looking at the tokenizer now yes.

natolambert commented 1 month ago

Tokenizer is promising that there's an issue. Looking at the tokens of a heterogenous batch looks like:

tensor([[    2,     2,     2,  ...,   873, 28723,     2],
        [    2,     2,     2,  ..., 23798, 28723,     2],
        [    1,     1,   733,  ..., 21824,  5020,   970],
        [    1,     1,   733,  ...,   272, 10725,   473],
        [    2,     2,     2,  ..., 22447, 28723,     2]], device='cuda:0')

2 is the padding token (also EOS token index). All sequences should have one of those at the end of the sequence to then predict reward (this is with --model=RLHFlow/RewardModel-Mistral-7B-for-DPA-v1).

1 is the BOS token, which every sequence should have.

28723 is ., which makes sense.

Examples 3 and 4 seem truncated weirdly to 512 samples. Checking. The examples are:

ipdb> batch["text_chosen"][3]
'<s>[INST] Can you summarize the discussion at the Goat and Compasses about Neil Rollinson\'s poems from the LRB, and the controversy surrounding how the heroine resurrects the hero\'s "thing"?: between her fingers. He groans.\nswells in the moist blowhole.\nand again from a premature end.\nof maggots. She’s got him now where she wants him.\nhim off with a long sponge of her tongue.\non his thigh, he thinks that’s it.\nwith a pump on the end.\nlike the stone of a sharon fruit.\nthe yolk with a tap of your finger.\nto feel the sticky glair on my tongue.\nuntil you’re clean, and tense as a clock spring.\nexplodes like the blowing of birds’ eggs.\nof a different room in your clothes.\nwhich I rehydrate with my tongue.\nrough with a smear of salt.\nThat was one hell of a shot.\nlike a lover sharing our bed.\nTwice monthly it is my practice to read snippets from the LRB during happy hour at my local, the Goat and Compasses. Neil Rollinson’s poems (LRB, 20 July) caused considerable discussion way into normal drinking time. Wilf, over in the snug, dismissed the lot as Page-Three titillation under the guise of intellectualism. Thelma, along with nearly all the lads at the bar, critiqued Rollinson’s obfuscation. Poetics carry an obligation towards pragmatics. By what means, I’m told to ask, does the heroine, using a. rope, b. silk thread and c. a plastic tube with a pump on the end, resurrect the hero’s ‘thing’? Harry said Pitt-Kethley would never permit such anatomically-impossible deception to grace her work. Short-shirt-sleeve Sam claimed it had nothing to do with what we were all thinking – it was really an allegorical commentary on John Major’s reelection. I’m neutral on all this. [/INST]It seems like the discussion at the Goat and Compasses revolved around Neil Rollinson\'s poems in the London Review of Books and the enigmatic description of how the heroine resurrects the hero\'s "thing" using a rope, silk thread, and a plastic tube with a pump on the end. Some people thought the poems were mere gratuitous erotica, while others interpreted them as an allegory for John Major\'s reelection.\nOverall, it seems like the conversation was pretty lively and divisive, with different people having different interpretations of the poems and their purpose. As an AI assistant, I amneutral and do not have personal opinions or beliefs.</s>'
ipdb> batch["text_chosen"][4]
"<s>[INST] is it normal to have a fear of flying? [/INST]Yes, it is normal to experience some degree of fear or anxiety when flying. Many people experience some level of fear or discomfort when flying, and it is estimated that up to 25% of people experience aviophobia or aerophobia, which is an extreme fear of flying.\n\nThere are several factors that can contribute to a fear of flying, including:\n\n* Unfamiliarity with flying and the sensations associated with it\n* Fear of crashing or losing control of the plane\n* Fear of heights\n* Fear of enclosed spaces\n* Previous traumatic experiences on flights\n* Media portrayals of plane crashes and accidents\n\nIt's important to remember that flying is one of the safest forms of transportation, and the chances of something going wrong are extremely low. However, for those who experience a fear of flying, it can be a significant source of distress and disruption to daily life.\n\nIf you are experiencing a fear of flying, there are several strategies and techniques that may help you feel more comfortable and secure during flights. These include:\n\n* Learning about the safety features and mechanisms of airplanes\n* Practicing relaxation techniques, such as deep breathing and visualization\n* Using positive self-talk and affirmations to counteract negative thoughts and feelings\n* Seeking support from a therapist or mental health professional\n* Gradually exposing yourself to flying in a controlled and gradual manner\n\nIt's also important to remember that it's okay to ask for help and support if you are experiencing a fear of flying. Many airlines and airports offer resources and accommodations for passengers with anxiety or fear, such as pre-flight tours of the airplane, access to relaxation techniques and materials, and the option to change your flight or travel plans if needed.\n\nRemember, it's important to be honest with yourself and your healthcare provider about your fears and concerns, and to seek help and support if you need it. With the right resources and techniques, it is possible to overcome a fear of flying and feel more comfortable and secure during flights.</s>"

andrewsiah commented 1 month ago

Ahh yea, in our pipeline we also increased max_token_length to a higher number, cause some of the (prompt,response) pairs sum up to more than 512 tokens. So the reward would be on the truncated answer, which wouldn't be intended?

natolambert commented 1 month ago

@andrewsiah with Truncation = False the tokenizer looks right. Making that default should at least help.

andrewsiah commented 1 month ago

ahh, and the truncation will be passed on to the reward model? I guess that's a good intention, ie bad reward models have short token length, eg.

natolambert commented 1 month ago

the truncation only ever applies to the tokenizer for things like this. Models will error if given tokens of incorrect size.

natolambert commented 1 month ago

Lol bad new, Changing padding of a single input with correct attention mask changes outputs. I took one example, ran it through the model. Then, I padded it manually with 5 pad tokens + attention mask accordingly. Then, the outputs are different.

andrewsiah commented 1 month ago

the truncation only ever applies to the tokenizer for things like this. Models will error if given tokens of incorrect size.

Wouldn't setting truncation=False then cause the models to error out if one of the rows have length > max_token_length for the model?

natolambert commented 1 month ago

For many of these models, model.config.num_labels > 1 which seems wrong. Not sure what to do about it.

RE Truncations, maybe, we should try it @andrewsiah. Much more deterministic

natolambert commented 1 month ago

Also @andrewsiah the RLHFFLOW RM you sent cannot be applied in this case. It has 10 classes and outputs 10 logits, which isn't suited to the RB usecase. Not sure how averaging was happening, but that one seems out of scope.

natolambert commented 1 month ago

Quote

We are aware of this phenomenon on all (or nearly all) models that contain rotary position embeddings (Llama, Llama2, Falcon, GPTNeoX, ...). Running things in fp32 helps avoid this problem, but that is far from a good solution.

I tested FP32 and FP16 with this model, https://huggingface.co/weqweasdas/RM-Mistral-7B, which has the correct configuration, and the logit differences are pretty minor (~1%). This is not great, but with the level of investigation it seems like this is a fundamental implementation problem with most reward models and not specific to reward bench.

Truncation and tokenization aside, there may be minor issues there, but I think they're not the root that we are looking at. I would put truncation in a separate issue than "numerical weirdness."

natolambert commented 1 month ago

My debugging code:

        logger.info("Default ===== ===== ===== ===== ===== ===== =====")
        batch_results, inputs = reward_pipe(batch["text_chosen"], return_inputs=True, **reward_pipeline_kwargs)
        print(batch_results)
        out_1 = reward_pipe([batch["text_chosen"][0]], **reward_pipeline_kwargs)
        out_2 = reward_pipe(batch["text_chosen"][1], **reward_pipeline_kwargs)
        out_3 = reward_pipe(batch["text_chosen"][2], **reward_pipeline_kwargs)
        print(f"idx 1 {out_1}")
        print(f"idx 2 {out_2}")
        print(f"idx 3 {out_3}")

        # print(inputs)
        print(reward_pipe.tokenizer(
            batch["text_chosen"][0],
            truncation=True,
            max_length=2048,
            padding=True,
            return_tensors="pt",
        ).to("cuda"))
        print(reward_pipe.tokenizer(
            batch["text_chosen"][1],
            truncation=True,
            max_length=2048,
            padding=True,
            return_tensors="pt",
        ).to("cuda"))
        # print(reward_pipe.tokenizer(
        #     batch["text_chosen"][2],
        #     truncation=True,
        #     max_length=2048,
        #     padding=True,
        #     return_tensors="pt",
        # ).to("cuda"))
        # print(reward_pipe.tokenizer(
        #     batch["text_chosen"][3],
        #     truncation=True,
        #     max_length=2048,
        #     padding=True,
        #     return_tensors="pt",
        # ).to("cuda"))

        import ipdb; ipdb.set_trace()

        logger.info("Default - seed? ===== ===== ===== ===== ===== ===== =====")
        print(reward_pipe(batch["text_chosen"], **reward_pipeline_kwargs))
        out_1 = reward_pipe(batch["text_chosen"][0], **reward_pipeline_kwargs)
        out_2 = reward_pipe(batch["text_chosen"][1], **reward_pipeline_kwargs)
        out_3 = reward_pipe(batch["text_chosen"][2], **reward_pipeline_kwargs)
        print(f"idx 1 {out_1}")
        print(f"idx 2 {out_2}")
        print(f"idx 3 {out_3}")

        logger.info("Default pad max length ===== ===== ===== ===== ===== ===== =====")
        reward_pipeline_kwargs = {
            "batch_size": args.batch_size,  # eval_args.inference_batch_size,
            "truncation": True,
            "padding": 'max_length',
            "max_length": args.max_length,
            "function_to_apply": "none",  # Compute raw logits
            "return_token_type_ids": False,
        }
        print(reward_pipe(batch["text_chosen"], **reward_pipeline_kwargs))
        out_1 = reward_pipe(batch["text_chosen"][0], **reward_pipeline_kwargs)
        out_2 = reward_pipe(batch["text_chosen"][1], **reward_pipeline_kwargs)
        out_3 = reward_pipe(batch["text_chosen"][2], **reward_pipeline_kwargs)
        out_4 = reward_pipe(batch["text_chosen"][3], **reward_pipeline_kwargs)
        out_5 = reward_pipe(batch["text_chosen"][4], **reward_pipeline_kwargs)
        print(f"idx 1 {out_1}")
        print(f"idx 2 {out_2}")
        print(f"idx 3 {out_3}")
        print(f"idx 4 {out_4}")
        print(f"idx 5 {out_5}")

        logger.info("DEBUGGING left padding/trunc max length ===== ===== ===== ===== ===== =====")
        reward_pipe.tokenizer.padding_side = "left"
        reward_pipe.tokenizer.truncation_side = "left"
        reward_pipeline_kwargs = {
            "batch_size": args.batch_size,  # eval_args.inference_batch_size,
            "truncation": True,
            "padding": 'max_length',
            "max_length": args.max_length,
            "function_to_apply": "none",  # Compute raw logits
            "return_token_type_ids": False,
        }
        print(reward_pipe(batch["text_chosen"], **reward_pipeline_kwargs))
        out_1 = reward_pipe(batch["text_chosen"][0], **reward_pipeline_kwargs)
        out_2 = reward_pipe(batch["text_chosen"][1], **reward_pipeline_kwargs)
        out_3 = reward_pipe(batch["text_chosen"][2], **reward_pipeline_kwargs)
        out_4 = reward_pipe(batch["text_chosen"][3], **reward_pipeline_kwargs)
        out_5 = reward_pipe(batch["text_chosen"][4], **reward_pipeline_kwargs)
        print(f"idx 1 {out_1}")
        print(f"idx 2 {out_2}")
        print(f"idx 3 {out_3}")
        print(f"idx 4 {out_4}")
        print(f"idx 5 {out_5}")

        logger.info("DEBUGGING left padding/trunc ===== ===== ===== ===== ===== =====")
        reward_pipe.tokenizer.padding_side = "left"
        reward_pipe.tokenizer.truncation_side = "left"
        reward_pipeline_kwargs = {
            "batch_size": args.batch_size,  # eval_args.inference_batch_size,
            "truncation": True,
            "padding": True,
            "max_length": args.max_length,
            "function_to_apply": "none",  # Compute raw logits
            "return_token_type_ids": False,
        }
        print(reward_pipe(batch["text_chosen"], **reward_pipeline_kwargs))
        out_1 = reward_pipe(batch["text_chosen"][0], **reward_pipeline_kwargs)
        out_2 = reward_pipe(batch["text_chosen"][1], **reward_pipeline_kwargs)
        out_3 = reward_pipe(batch["text_chosen"][2], **reward_pipeline_kwargs)
        print(f"idx 1 {out_1}")
        print(f"idx 2 {out_2}")
        print(f"idx 3 {out_3}")

        logger.info("DEBUGGING left padding/trunc / no add eos ===== ===== ===== ===== ===== =====")
        reward_pipe.tokenizer.padding_side = "left"
        reward_pipe.tokenizer.truncation_side = "left"
        reward_pipe.tokenizer.add_eos_token = False
        reward_pipeline_kwargs = {
            "batch_size": args.batch_size,  # eval_args.inference_batch_size,
            "truncation": True,
            "padding": True,
            "max_length": args.max_length,
            "function_to_apply": "none",  # Compute raw logits
            "return_token_type_ids": False,
        }
        print(reward_pipe(batch["text_chosen"], **reward_pipeline_kwargs))
        out_1 = reward_pipe(batch["text_chosen"][0], **reward_pipeline_kwargs)
        out_2 = reward_pipe(batch["text_chosen"][1], **reward_pipeline_kwargs)
        out_3 = reward_pipe(batch["text_chosen"][2], **reward_pipeline_kwargs)
        print(f"idx 1 {out_1}")
        print(f"idx 2 {out_2}")
        print(f"idx 3 {out_3}")

        logger.info("DEBUGGING right padding/trunc / no add eos ===== ===== ===== ===== ===== =====")
        reward_pipe.tokenizer.padding_side = "right"
        reward_pipe.tokenizer.truncation_side = "right"
        reward_pipe.tokenizer.add_eos_token = False
        reward_pipeline_kwargs = {
            "batch_size": args.batch_size,  # eval_args.inference_batch_size,
            "truncation": True,
            "padding": True,
            "max_length": args.max_length,
            "function_to_apply": "none",  # Compute raw logits
            "return_token_type_ids": False,
        }
        print(reward_pipe(batch["text_chosen"], **reward_pipeline_kwargs))
        out_1 = reward_pipe(batch["text_chosen"][0], **reward_pipeline_kwargs)
        out_2 = reward_pipe(batch["text_chosen"][1], **reward_pipeline_kwargs)
        out_3 = reward_pipe(batch["text_chosen"][2], **reward_pipeline_kwargs)
        print(f"idx 1 {out_1}")
        print(f"idx 2 {out_2}")
        print(f"idx 3 {out_3}")

        logger.info("DEBUGGING no trunc ===== ===== ===== ===== ===== =====")
        reward_pipeline_kwargs = {
            "batch_size": args.batch_size,  # eval_args.inference_batch_size,
            "truncation": False,
            "padding": True,
            "max_length": args.max_length,
            "function_to_apply": "none",  # Compute raw logits
            "return_token_type_ids": False,
        }
        print(reward_pipe(batch["text_chosen"], **reward_pipeline_kwargs))
        out_1 = reward_pipe(batch["text_chosen"][0], **reward_pipeline_kwargs)
        out_2 = reward_pipe(batch["text_chosen"][1], **reward_pipeline_kwargs)
        out_3 = reward_pipe(batch["text_chosen"][2], **reward_pipeline_kwargs)
        print(f"idx 1 {out_1}")
        print(f"idx 2 {out_2}")
        print(f"idx 3 {out_3}")

andrewsiah commented 1 month ago

Thanks for sharing the above, is your finding saying changing fp32 and fp16 is interrelated with the batchsize issue? Or that the output score changes but it isn't related?

natolambert commented 1 month ago

@andrewsiah I think it's all related to the underlying handling of compute + weird positional embeddings. For a while we had to maintain our own transformers fork for open-instruct but decided it is a losing battle. Click on some of the issues I've linked.

Better RMs seem to have less variance.

andrewsiah commented 1 month ago

Ahh, thanks for that. That's quite interesting.

natolambert commented 4 weeks ago

Seems like this is expected behavior. Wrote a long dive on what we tested and found. https://www.interconnects.ai/p/reward-bench-reproducibility

kangqiyue commented 1 week ago

Relevant to padding: AutoModelForSequenceClassification takes the score of the first padding index. Left padding shouldn't work then. https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L1376

Here are some examples of one reward model, no padding, with different batches.
ipdb> reward_pipe(batch["text_chosen"], **reward_pipeline_kwargs)
[INFO|base.py:1170] 2024-06-04 19:06:03,405 >> Disabling tokenizer parallelism, we're using DataLoader multithreading already
[{'label': 'LABEL_0', 'score': -2.029296875}, {'label': 'LABEL_0', 'score': -2.755859375}, {'label': 'LABEL_0', 'score': -2.650390625}, {'label': 'LABEL_0', 'score': -2.76171875}, {'label': 'LABEL_0', 'score': -2.619140625}]
ipdb> reward_pipe(batch["text_chosen"][0], **reward_pipeline_kwargs)
[{'label': 'LABEL_0', 'score': -1.998046875}]
ipdb> reward_pipe([batch["text_chosen"][0],]*5, **reward_pipeline_kwargs)
[{'label': 'LABEL_0', 'score': -1.9755859375}, {'label': 'LABEL_0', 'score': -2.712890625}, {'label': 'LABEL_0', 'score': -2.59375}, {'label': 'LABEL_0', 'score': -2.7421875}, {'label': 'LABEL_0', 'score': -2.568359375}]
The last number is particularly damning, different numbers for the same input.

Hello, I have a question here when I training a reward model. In the code from transformers:

https://github.com/huggingface/transformers/blob/b6c9f47fd6f911450024c52e382e544e5d04387a/src/transformers/models/llama/modeling_llama.py#L1372

The code takes the first pad_token(or eos_token, if they are the same). However, when we set tokenizer.padding_side="left". The tokenizer will pad all sentences with pad_token from the left. I guess the padded input_ids maybe as follows:

[PAD, PAD, BOS, 1, 2, ..., EOS] I test the code sequence_lengths = torch.eq(input_ids, self.config.pad_token_id).int().argmax(-1) - 1

it will find the position of the first PAD_TOKEN. However, when we use batch_size=2 and left_padding in reward model training, the sequence_lengths is wrong, therefore the position is wrong. Then the training fails? Do I miss sth or get sth wrong?

In the ppov2 trainer from trl, we set the tokenizer with left padding. https://github.com/huggingface/trl/blob/b68ff96f0c74368961e194081e122959cd1f4d4d/examples/scripts/ppo/ppo_tldr.py#L57

kangqiyue commented 1 week ago

I got it. If the first token is PAD, then it will get 0. Then the position will be -1. Brilliant code!

natolambert commented 1 week ago

I got it. If the first token is PAD, then it will get 0. Then the position will be -1. Brilliant code! Ah! Yeah. Cool :)

allenai / reward-bench

rewardbench.py results are different for different batch size for beaver-7b #137