eureka-research / Eureka

Official Repository for "Eureka: Human-Level Reward Design via Coding Large Language Models" (ICLR 2024)
https://eureka-research.github.io/
MIT License
2.8k stars 254 forks source link

Bad performance on experiment reproduction #51

Open JasonLiu324 opened 2 days ago

JasonLiu324 commented 2 days ago

Hi, I have successfully run the whole project and tested on several gym tasks, like FrankaCabinet and Humanoid. But the experiment result is not so good as I expected. What may be the reason?

My workstation environment is: Ubuntu 22.04 12GB RTX 4080 GPU 16GB CPU

And the command lines I have used are: python eureka.py env=FrankaCabinet sample=5 iteration=5 model_name=gpt-4 python eureka.py env=Anymal sample=5 iteration=5 model_name=gpt-4

The final success rate is only approximately 0.1. Does it related to the number of samples? My workstation can only run 5 samples in parallel due to the limit of GPU memory.

JasonLiu324 commented 1 day ago

And the wierd thing is that the reward reflection during the running process is almost the same: Iteration 0: User Content: We trained a RL policy using the provided reward function code and tracked the values of the individual components in the reward function as well as global policy metrics such as success rates and episode lengths after every 300 epochs and the maximum, mean, minimum values encountered: distance_reward: ['0.79', '0.95', '0.91', '0.90', '0.90', '0.80', '0.86', '0.92', '0.92', '0.88'], Max: 0.98, Mean: 0.89, Min: 0.76 door_open_reward: ['0.00', '0.08', '0.18', '0.28', '0.29', '0.18', '0.15', '0.00', '0.16', '0.00'], Max: 0.32, Mean: 0.13, Min: 0.00 task_score: ['0.00', '0.00', '0.00', '0.02', '0.01', '0.03', '0.01', '0.00', '0.02', '0.00'], Max: 0.11, Mean: 0.01, Min: 0.00 episode_lengths: ['499.00', '359.18', '500.00', '495.78', '496.36', '493.24', '492.34', '499.69', '500.00', '500.00'], Max: 500.00, Mean: 490.73, Min: 230.97

Iteration 1: User Content: We trained a RL policy using the provided reward function code and tracked the values of the individual components in the reward function as well as global policy metrics such as success rates and episode lengths after every 300 epochs and the maximum, mean, minimum values encountered: distance_reward: ['0.79', '0.95', '0.91', '0.90', '0.90', '0.80', '0.86', '0.92', '0.92', '0.88'], Max: 0.98, Mean: 0.89, Min: 0.76 door_open_reward: ['0.00', '0.08', '0.18', '0.28', '0.29', '0.18', '0.15', '0.00', '0.16', '0.00'], Max: 0.32, Mean: 0.13, Min: 0.00 task_score: ['0.00', '0.00', '0.00', '0.02', '0.01', '0.03', '0.01', '0.00', '0.02', '0.00'], Max: 0.11, Mean: 0.01, Min: 0.00 episode_lengths: ['499.00', '359.18', '500.00', '495.78', '496.36', '493.24', '492.34', '499.69', '500.00', '500.00'], Max: 500.00, Mean: 490.73, Min: 230.97

Iteration 2: User Content: We trained a RL policy using the provided reward function code and tracked the values of the individual components in the reward function as well as global policy metrics such as success rates and episode lengths after every 300 epochs and the maximum, mean, minimum values encountered: distance_reward: ['0.79', '0.95', '0.91', '0.90', '0.90', '0.80', '0.86', '0.92', '0.92', '0.88'], Max: 0.98, Mean: 0.89, Min: 0.76 door_open_reward: ['0.00', '0.08', '0.18', '0.28', '0.29', '0.18', '0.15', '0.00', '0.16', '0.00'], Max: 0.32, Mean: 0.13, Min: 0.00 task_score: ['0.00', '0.00', '0.00', '0.02', '0.01', '0.03', '0.01', '0.00', '0.02', '0.00'], Max: 0.11, Mean: 0.01, Min: 0.00 episode_lengths: ['499.00', '359.18', '500.00', '495.78', '496.36', '493.24', '492.34', '499.69', '500.00', '500.00'], Max: 500.00, Mean: 490.73, Min: 230.97

Iteration 3: User Content: We trained a RL policy using the provided reward function code and tracked the values of the individual components in the reward function as well as global policy metrics such as success rates and episode lengths after every 300 epochs and the maximum, mean, minimum values encountered: distance_reward: ['0.79', '0.95', '0.91', '0.90', '0.90', '0.80', '0.86', '0.92', '0.92', '0.88'], Max: 0.98, Mean: 0.89, Min: 0.76 door_open_reward: ['0.00', '0.08', '0.18', '0.28', '0.29', '0.18', '0.15', '0.00', '0.16', '0.00'], Max: 0.32, Mean: 0.13, Min: 0.00 task_score: ['0.00', '0.00', '0.00', '0.02', '0.01', '0.03', '0.01', '0.00', '0.02', '0.00'], Max: 0.11, Mean: 0.01, Min: 0.00 episode_lengths: ['499.00', '359.18', '500.00', '495.78', '496.36', '493.24', '492.34', '499.69', '500.00', '500.00'], Max: 500.00, Mean: 490.73, Min: 230.97

The values are totally the same. I think there must be something wrong with the training process.