Thanks for your question. I believe this method can be used in other datasets. The principle is to generate trajectories, roll out them in the environment, evaluate the results (rewards) of the trajectories, and update the reward parameters. So, the environment is to simulate the results of the generated trajectories and return the features of the trajectories.
Thanks for your question. I believe this method can be used in other datasets. The principle is to generate trajectories, roll out them in the environment, evaluate the results (rewards) of the trajectories, and update the reward parameters. So, the environment is to simulate the results of the generated trajectories and return the features of the trajectories.