I just have a question regarding the necessity of ValueLearner. Given that we are training on the same offline dataset, why don't we just pick the Return directly from the dataset when computing the advantage? Why would it be beneficial to train another value model to predict the Return value?
I just have a question regarding the necessity of ValueLearner. Given that we are training on the same offline dataset, why don't we just pick the Return directly from the dataset when computing the advantage? Why would it be beneficial to train another value model to predict the Return value?