MARIO-Math-Reasoning / Super_MARIO

MIT License

172 stars 13 forks source link

About Training data generation. #16

Closed George-Chia closed 2 weeks ago

George-Chia commented 1 month ago

llm run for step evaluation

prompts, prompts_span = self.value_preprocess(valid_solvers)

After executing this line, prompt always got [], and prompts_span got all-zeros list, which makes the training cycle of step break.

lovecambi commented 1 month ago

llm run for step evaluation

prompts, prompts_span = self.value_preprocess(valid_solvers)

After executing this line, prompt always got [], and prompts_span got all-zeros list, which makes the training cycle of step break.

What is your running script?

eurekayuan commented 3 weeks ago

@George-Chia You may want to set update_leaf_value=True. Otherwise the code will not evaluate $Q(s_t, a_t)$, that's why your prompts always got []. Besides, even when this flag is False, the training cycle will not break. The prompts will be loaded again in the next iteration.