eureka-research / Eureka

Official Repository for "Eureka: Human-Level Reward Design via Coding Large Language Models" (ICLR 2024)
https://eureka-research.github.io/
MIT License
2.73k stars 244 forks source link

Request for Information on Prompts Used in Curriculum Learning for Pen Spinning Task #29

Open Akihisa-Watanabe opened 7 months ago

Akihisa-Watanabe commented 7 months ago

This is a very interesting research. Thank you for making it public. I ran python eureka.py with the following settings and conducted training and testing. I got the results shown in the figure and video below.

Additionally, I am attempting to replicate the impressive Pen Spinning video that you have shared. You mentioned in your paper and on the project page that you utilized Curriculum Learning.

The pen spinning task requires a Shadow Hand to continuously rotate a pen to achieve some pre-defined spinning patterns for as many cycles as possible. We solve this task by (1) instructing Eureka to generate a reward function for re-orienting the pen to random target configurations, and then (2) fine-tuning this pre-trained policy using the Eureka reward to reach the desired sequence of pen-spinning configurations. As shown, Eureka fine-tuning quickly adapts the policy to successfully spin the pen for many cycles in a row. In contrast, neither pre-trained or learning-from-scratch policies can complete even a single cycle.

If you don't mind, could you please share the prompts you used during the (1) pre-training and (2) fine-tuning stages of Curriculum Learning?

defaults:
  - _self_
  - env: shadow_hand
  - override hydra/launcher: local
  - override hydra/output: local

hydra:
  job:
    chdir: True

# LLM parameters
model: gpt-4-0314  # LLM model (other options are gpt-4, gpt-4-0613, gpt-3.5-turbo-16k-0613)
temperature: 1.0
suffix: GPT  # suffix for generated files (indicates LLM model)

# Eureka parameters
iteration: 5 # how many iterations of Eureka to run
sample: 16 # number of Eureka samples to generate per iteration
max_iterations: 3000 # RL Policy training iterations (decrease this to make the feedback loop faster)
num_eval: 5 # number of evaluation episodes to run for the final reward
capture_video: False # whether to capture policy rollout videos

# Weights and Biases
use_wandb: False # whether to use wandb for logging
wandb_username: "" # wandb username if logging with wandb
wandb_project: "" # wandb project if logging with wandb
task: ShadowHand
env_name: shadow_hand
description: to make the shadow hand spin the object to a target orientation 

summary

https://github.com/eureka-research/Eureka/assets/57210762/e2f89c58-f996-40f4-98fe-b4c66842f2c8