Question about "Reward Generation Problem". Having env ground truth for RL seems weird.

eureka-research / Eureka

Official Repository for "Eureka: Human-Level Reward Design via Coding Large Language Models" (ICLR 2024)

https://eureka-research.github.io/

MIT License

2.73k stars 244 forks source link

Question about "Reward Generation Problem". Having env ground truth for RL seems weird. #14

Closed CeHao1 closed 8 months ago

CeHao1 commented 8 months ago

Hi, my name is Ce Hao. I really appreciate the Eureka that can generate a reward function via LLM. You formulate this problem as a "reward generation problem", and I agree with this point.

However, I have a question about the problem, you assume access to the source code of the environment and then apply RL algorithms. Usually, the ground truth of the environmental dynamics is not provided for RL methods. And the environmental model is just a black box for RL agents. That is why we rely on RL to explore the environment.

So in the real-world environment, where the source code is definitely infeasible, how can we use Eureka to generate a suitable reward function? Maybe it is a good future direction. Thanks.

CeHao1 commented 8 months ago

From another perspective, the LLM is extracting information and knowledge from the source code to finish the task.

A very intuitive way is, we ask the LLM to directly plan a successful trajectory based on the source code. And the dense reward function is to minimize the deviation between the reference successful trajectory and the current state. Namely, the RL is a tracking controller.

I admit generating a dense reward is not identical to a dense successful trajectory, but from LLM's perspective, they might be similar. That is why I think given the ground truth of the environment and applying RL is weird.

fangyuan-ksgk commented 7 months ago

I am not sure if I have understand the code correctly. But my current belief is that Eureka assumes only partial access to the source code -- in the eureka.py file, I only see the observation code (this is the env_obs.py file, which is minimal and mainly describes what are the non-visual observations, the main env.py file is never included into the prompt, but is only used to insert back the GPT-generated reward function and run the RL experiments) getting into the prompt. In this sense I would argue that the requirement is still reasonable. (Which in a sense saves some cash compared to using the GPT4V api to directly observe the visual performance at least :-> )