MARIO-Math-Reasoning / Super_MARIO

MIT License
254 stars 16 forks source link

value estimation twice? #11

Closed platoonpluto closed 5 months ago

platoonpluto commented 6 months ago

hi @lovecambi , Is there redundant value estimation during mcts?

during the first call for llm generation, LLM generate a response and its corresponding value estimation.

during the second call for llm generation, LLM generate a response(only one token??) and its corresponding value .

Is there anything wrong?

lovecambi commented 6 months ago

hi @lovecambi , Is there redundant value estimation during mcts?

during the first call for llm generation, LLM generate a response and its corresponding value estimation.

during the second call for llm generation, LLM generate a response(only one token??) and its corresponding value .

Is there anything wrong?

In our paper, we use the last token of current step to predict the value. To obtain the last token,

  1. run the LLM to generate the text + code
  2. run the code in interpreter to obtain the observation, and append observation to the text + code to form the entire current step
  3. run the LLM to predict the value (You are correct, Only one token)
Chen-GX commented 6 months ago

In fact, value is an evaluation of the current input $s_t$. Therefore, the second call is merely to assess the value of the entire reasoning step of $s_t + a_t$. Hence, for the second call, we set the number of generated tokens to 1, because our goal is just to estimate value of $s_t + a_t$ instead of generating a new reasoning step.

platoonpluto commented 6 months ago

hi @lovecambi , Is there redundant value estimation during mcts? during the first call for llm generation, LLM generate a response and its corresponding value estimation. during the second call for llm generation, LLM generate a response(only one token??) and its corresponding value . Is there anything wrong?

In our paper, we use the last token of current step to predict the value. To obtain the last token,

  1. run the LLM to generate the text + code
  2. run the code in interpreter to obtain the observation, and append observation to the text + code to form the entire current step
  3. run the LLM to predict the value (You are correct, Only one token)

I mean, you don't need to generate one token at all, a simple forward pass of $s_t + a_t$ could return the value estimation as well. BTW, the one token generated by the second LLM call is not used, right?

Chen-GX commented 6 months ago

hi @lovecambi , Is there redundant value estimation during mcts? during the first call for llm generation, LLM generate a response and its corresponding value estimation. during the second call for llm generation, LLM generate a response(only one token??) and its corresponding value . Is there anything wrong?

In our paper, we use the last token of current step to predict the value. To obtain the last token,

  1. run the LLM to generate the text + code
  2. run the code in interpreter to obtain the observation, and append observation to the text + code to form the entire current step
  3. run the LLM to predict the value (You are correct, Only one token)

I mean, you don't need to generate one token at all, a simple forward pass of st+at could return the value estimation as well. BTW, the one token generated by the second LLM call is not used, right?

You are right, the token produced by the second LLM call will not be used; we are just doing it for the value estimation. I am not sure if vllm will function properly when the generation length is set to 0.

platoonpluto commented 6 months ago

hi @lovecambi , Is there redundant value estimation during mcts? during the first call for llm generation, LLM generate a response and its corresponding value estimation. during the second call for llm generation, LLM generate a response(only one token??) and its corresponding value . Is there anything wrong?

In our paper, we use the last token of current step to predict the value. To obtain the last token,

  1. run the LLM to generate the text + code
  2. run the code in interpreter to obtain the observation, and append observation to the text + code to form the entire current step
  3. run the LLM to predict the value (You are correct, Only one token)

I mean, you don't need to generate one token at all, a simple forward pass of st+at could return the value estimation as well. BTW, the one token generated by the second LLM call is not used, right?

You are right, the token produced by the second LLM call will not be used; we are just doing it for the value estimation. I am not sure if vllm will function properly when the generation length is set to 0.

OK, thanks!