Add `pad_token_id` from tokenizer to model config.

hank0316 commented 2 months ago

Resolves #115

Add pad_token_id to model config for models whose config did not contains pad_token_id. ex. TinyLlama

natolambert commented 2 months ago

Hey @hank0316 -- looks good. Two quick things:

Did you check that this doesn't break other models? I'm not sure if it needs to be an elif vs an if, did you check?
Can you add the same code to this script too? https://github.com/allenai/reward-bench/blob/5cd2fe67962cb848e3db0f67b380540465169f06/scripts/run_bon.py#L171
Maybe add a comment as to why we did this?

Regardless, should be pretty simple.

hank0316 commented 2 months ago

@natolambert Thanks for the guidance. Here's the update:

scripts/run_rm.py:

I add comments about the change.
I change elif to if since I think if is more reasonable.
I've tried to run python scripts/run_rm.py --model=OpenAssistant/oasst-rm-2.1-pythia-1.4b-epoch-2.5 --chat_template=oasst_pythia, the modification seems to be okay, but the execution failed due to CUDA OOM when running inference (I only have one 32GB V100).

scripts/run_bon.py:

I modified the script and add comments.

I tried to test it with

python scripts/run_bon.py --model=OpenAssistant/oasst-rm-2.1-pythia-1.4b-epoch-2.5 --chat_template=oasst_pythia --best_of=8 --debug

, but it failed due to this error:

Traceback (most recent call last):
File "/home/hank0316/reward-bench/scripts/run_bon.py", line 324, in <module>
main()
File "/home/hank0316/reward-bench/scripts/run_bon.py", line 124, in main
dataset = load_bon_dataset(
File "/home/hank0316/reward-bench/rewardbench/utils.py", line 270, in load_bon_dataset
alpaca_eval = load_dataset("ai2-adapt-dev/HERM_BoN_candidates", "alpaca_eval")
File "/home/hank0316/.local/lib/python3.10/site-packages/datasets/load.py", line 2587, in load_dataset
builder_instance = load_dataset_builder(
File "/home/hank0316/.local/lib/python3.10/site-packages/datasets/load.py", line 2259, in load_dataset_builder
dataset_module = dataset_module_factory(
File "/home/hank0316/.local/lib/python3.10/site-packages/datasets/load.py", line 1904, in dataset_module_factory
raise e1 from None
File "/home/hank0316/.local/lib/python3.10/site-packages/datasets/load.py", line 1846, in dataset_module_factory
raise DatasetNotFoundError(msg + f" at revision '{revision}'" if revision else msg)
datasets.exceptions.DatasetNotFoundError: Dataset 'ai2-adapt-dev/HERM_BoN_candidates' doesn't exist on the Hub or cannot be accessed

allenai / reward-bench

Add `pad_token_id` from tokenizer to model config. #117