Closed platoonpluto closed 5 months ago
hi @lovecambi , Is there redundant value estimation during mcts?
during the first call for llm generation, LLM generate a response and its corresponding value estimation.
during the second call for llm generation, LLM generate a response(only one token??) and its corresponding value .
Is there anything wrong?
In our paper, we use the last token of current step to predict the value. To obtain the last token,
In fact, value is an evaluation of the current input $s_t$. Therefore, the second call is merely to assess the value of the entire reasoning step of $s_t + a_t$. Hence, for the second call, we set the number of generated tokens to 1, because our goal is just to estimate value of $s_t + a_t$ instead of generating a new reasoning step.
hi @lovecambi , Is there redundant value estimation during mcts? during the first call for llm generation, LLM generate a response and its corresponding value estimation. during the second call for llm generation, LLM generate a response(only one token??) and its corresponding value . Is there anything wrong?
In our paper, we use the last token of current step to predict the value. To obtain the last token,
- run the LLM to generate the text + code
- run the code in interpreter to obtain the observation, and append observation to the text + code to form the entire current step
- run the LLM to predict the value (You are correct, Only one token)
I mean, you don't need to generate one token at all, a simple forward pass of $s_t + a_t$ could return the value estimation as well. BTW, the one token generated by the second LLM call is not used, right?
hi @lovecambi , Is there redundant value estimation during mcts? during the first call for llm generation, LLM generate a response and its corresponding value estimation. during the second call for llm generation, LLM generate a response(only one token??) and its corresponding value . Is there anything wrong?
In our paper, we use the last token of current step to predict the value. To obtain the last token,
- run the LLM to generate the text + code
- run the code in interpreter to obtain the observation, and append observation to the text + code to form the entire current step
- run the LLM to predict the value (You are correct, Only one token)
I mean, you don't need to generate one token at all, a simple forward pass of st+at could return the value estimation as well. BTW, the one token generated by the second LLM call is not used, right?
You are right, the token produced by the second LLM call will not be used; we are just doing it for the value estimation. I am not sure if vllm will function properly when the generation length is set to 0.
hi @lovecambi , Is there redundant value estimation during mcts? during the first call for llm generation, LLM generate a response and its corresponding value estimation. during the second call for llm generation, LLM generate a response(only one token??) and its corresponding value . Is there anything wrong?
In our paper, we use the last token of current step to predict the value. To obtain the last token,
- run the LLM to generate the text + code
- run the code in interpreter to obtain the observation, and append observation to the text + code to form the entire current step
- run the LLM to predict the value (You are correct, Only one token)
I mean, you don't need to generate one token at all, a simple forward pass of st+at could return the value estimation as well. BTW, the one token generated by the second LLM call is not used, right?
You are right, the token produced by the second LLM call will not be used; we are just doing it for the value estimation. I am not sure if vllm will function properly when the generation length is set to 0.
OK, thanks!
hi @lovecambi , Is there redundant value estimation during mcts?
during the first call for llm generation, LLM generate a response and its corresponding value estimation.
during the second call for llm generation, LLM generate a response(only one token??) and its corresponding value .
Is there anything wrong?