allenai / open-instruct

Apache License 2.0
1.21k stars 166 forks source link

Stop GSM8k generation at double new line #147

Closed OyvindTafjord closed 5 months ago

OyvindTafjord commented 5 months ago

This loosens the stopping criteria for GSM8k generation, to allow for a single newline which is sometimes generated at the start of an output, but stops at a double newline (as compatible with the prompt).

It's a bit hacky, but should work for tokenizers where "\n\n" is either one or two tokens. Often "\n\n" is indeed a single token, so this should actually be more effective than the previous "\n" stopping condition if the model actual does generate a double new line. FYI @dwadden

hamishivi commented 5 months ago

I'm a little hesitant to merge this without checking if it changes our old evaluations. I wonder if a good solution would be to make the stop sequence configurable to avoid this issue. @yizhongw thoughts?

OyvindTafjord commented 5 months ago

Yeah, I agree the backwards compatibility is potentially awkward. We could put it behind a flag which would preserve earlier behavior. I'm not sure how strict you feel about not changing earlier evals. E.g., I noticed a typo in line 56 ("Quesion: " + example["question"]), is that something you also wouldn't want to change now?

hamishivi commented 5 months ago

Yeah, I think in general this is an awkward thing. I don't want to make changes when papers are in motion to avoid having to re-compute old evals, but also its worth tracking and updating these things, while also trying to avoid making some large number of flags. Really, coming up with some sort of 'release process'/versioning setup for this would be nice...

dwadden commented 5 months ago

I need this PR for evals I'm running now. Instead of changing the existing gsm task, could we just add another one that allows newlines?

OyvindTafjord commented 5 months ago

I added an option now so that this shouldn't change existing behavior unless the option --stop_at_double_newline is added to the run_eval call. E.g., in submit_eval_jobs.py would do this to get the new behavior:

    elif experiment_group == "gsm_cot":
        task_spec['arguments'][0] = '''
            python -m eval.gsm.run_eval \
            --data_dir /data/gsm/ \
            --max_num_examples 200 \
            --save_dir /output/ \
            --use_vllm \
            --model_name_or_path /model \
            --tokenizer_name_or_path /model \
            --n_shot 8 \
            --use_chat_format \
            --chat_formatting_function eval.templates.create_prompt_with_tulu_chat_format \
            --stop_at_double_newline
        ''' 
hamishivi commented 5 months ago

Thanks, thats fine! Probably not an ideal solution for all scenarios, but seems like a decent stopgap for now.

dwadden commented 5 months ago

Perfect, thanks! I'll rerun evals with this.