One question about inference

xiaolizh1 commented 7 months ago

May I ask how to control the generator to generate only one step each time during the inference phase of the GSM8K experiment.

OakYU commented 6 months ago

what do you mean by only one-step generation each time? The inference is conducted by generating steps one by one. If you want to obtain only the first step, you can set the max_n_step as 1

xiaolizh1 commented 6 months ago

what do you mean by only one-step generation each time? The inference is conducted by generating steps one by one. If you want to obtain only the first step, you can set the max_n_step as 1

Sorry, my expression may not be very clear. What I mean is whether it is necessary to use "prompt" to control the output of the model, and what criteria should be used to divide a step

OakYU commented 6 months ago

In our experiments, we split the steps using the newline \n as the separator, on both GSM8K and Game of 24. Since we first fine-tune the generator on the training set, the model can produce reasoning paths with \n at the end of each sentence. Then during inference, we detect the \n and stop the generation every time we meet a \n for stepwise generation. If needed, you can also generalize it to prompt-based setting (without fine-tuning) by providing exemplars with \n to separate steps

xiaolizh1 commented 6 months ago

In our experiments, we split the steps using the newline \n as the separator, on both GSM8K and Game of 24. Since we first fine-tune the generator on the training set, the model can produce reasoning paths with \n at the end of each sentence. Then during inference, we detect the \n and stop the generation every time we meet a \n for stepwise generation. If needed, you can also generalize it to prompt-based setting (without fine-tuning) by providing exemplars with \n to separate steps

Okay, thank you for your patient answer

xiaolizh1 commented 6 months ago

By the way, have you tried the performance of the model after SFT？

OakYU commented 6 months ago

You can see Table 4 in the latest version of our paper, the greedy performance of the fine-tuned Llama2-7B and Mistral-7B on GSM8K is 38.6% and 58.4%, respectively

xiaolizh1 commented 6 months ago

Sorry to bother you, may I ask what is the Verifier Classification Accuracy in your paper?

OakYU commented 6 months ago

we don't report verifier classification accuracy metrics in the paper. we only refer to this metric when tuning the value model training hyperparameters

xiaolizh1 commented 6 months ago

Would it be convenient to disclose the final Verifier Classification accuracy?

OakYU commented 6 months ago

you can refer to the results in files under eval_results in this repo, e.g. https://github.com/FreedomIntelligence/OVM/blob/main/eval_results/gsm8k/verifier/test/metrics_v(mistral7b-ep2-n100-scahead-mse-lm-token)_g(mistral7b-ep2).json

specifically, in gsm8k: for mistral-7b, 90.71% over 300 solutions per problem on the test set; for llama2-7b, 90.66% over 500 solutions per problem

in game of 24: 99.93% over 300 solutions per problem, for llama2-7b

it's more reasonable to refer to mp1 than the classification accuracy. you can find those metrics in the released result files if you are interested

xiaolizh1 commented 6 months ago

Okay, thank you very much

FreedomIntelligence / OVM

One question about inference #7