Closed xiaolizh1 closed 3 months ago
what do you mean by only one-step generation each time? The inference is conducted by generating steps one by one. If you want to obtain only the first step, you can set the max_n_step
as 1
what do you mean by only one-step generation each time? The inference is conducted by generating steps one by one. If you want to obtain only the first step, you can set the
max_n_step
as 1
Sorry, my expression may not be very clear. What I mean is whether it is necessary to use "prompt" to control the output of the model, and what criteria should be used to divide a step
In our experiments, we split the steps using the newline \n
as the separator, on both GSM8K and Game of 24. Since we first fine-tune the generator on the training set, the model can produce reasoning paths with \n
at the end of each sentence. Then during inference, we detect the \n
and stop the generation every time we meet a \n
for stepwise generation. If needed, you can also generalize it to prompt-based setting (without fine-tuning) by providing exemplars with \n
to separate steps
In our experiments, we split the steps using the newline
\n
as the separator, on both GSM8K and Game of 24. Since we first fine-tune the generator on the training set, the model can produce reasoning paths with\n
at the end of each sentence. Then during inference, we detect the\n
and stop the generation every time we meet a\n
for stepwise generation. If needed, you can also generalize it to prompt-based setting (without fine-tuning) by providing exemplars with\n
to separate steps
Okay, thank you for your patient answer
By the way, have you tried the performance of the model after SFT?
You can see Table 4 in the latest version of our paper, the greedy performance of the fine-tuned Llama2-7B and Mistral-7B on GSM8K is 38.6% and 58.4%, respectively
Sorry to bother you, may I ask what is the Verifier Classification Accuracy in your paper?
we don't report verifier classification accuracy metrics in the paper. we only refer to this metric when tuning the value model training hyperparameters
Would it be convenient to disclose the final Verifier Classification accuracy?
you can refer to the results in files under eval_results
in this repo, e.g. https://github.com/FreedomIntelligence/OVM/blob/main/eval_results/gsm8k/verifier/test/metrics_v(mistral7b-ep2-n100-scahead-mse-lm-token)_g(mistral7b-ep2).json
specifically, in gsm8k: for mistral-7b, 90.71% over 300 solutions per problem on the test set; for llama2-7b, 90.66% over 500 solutions per problem
in game of 24: 99.93% over 300 solutions per problem, for llama2-7b
it's more reasonable to refer to mp1 than the classification accuracy. you can find those metrics in the released result files if you are interested
Okay, thank you very much
May I ask how to control the generator to generate only one step each time during the inference phase of the GSM8K experiment.