“When putting reinforcement learning in the realm of large language models, the environment distribution and the output distribution of the policy model π RL(y|x) are identical. It means that the distribution of the environment shifts as π RL(y|x) is optimized.”这句话我有点没看懂,在RLFH中,SFT模型是那个agent,那environment不是应当指代的是reword model吗,这里的environment distribution好像是指的SFT模型的生成的回答的分布(如果我没有理解错的话),那这个不是应该叫做action distribution吗?