I've successfully set up the environment following the repository instructions, but I'm having trouble reproducing the results mentioned in the paper. (which means that the model learns nothing using your ClipReward)
Clip Model used: ViT-g-14/laion2b_s34b_b88k
I've tested with two environments:
Humanoid-v4
CartPole-v1
Configuration used for Humanoid-v4:
env_name: Humanoid-v4 # RL environment name
base_path: runs/training # Base path to save logs and checkpoints
seed: 42 # Seed for reproducibility
description: Humanoid training using CLIP reward
tags: # Wandb tags
- training
- humanoid
- CLIP
reward:
name: clip
pretrained_model: ViT-g-14/laion2b_s34b_b88k # CLIP model name
# CLIP batch size
# per synchronous inference step.
# Batch size must be divisible by n_workers (GPU count)
# so that it can be shared among workers, and must be a divisor
# of n_envs * episode_length so that all batches can be of the
# same size (no support for variable batch size as of now.)
batch_size: 1600
alpha: 0.5 # Alpha value of Baseline CLIP (CO-RELATE)
target_prompts: # Description of the goal state
- a humanoid robot running
baseline_prompts: # Description of the environment
- a humanoid robot
# Path to pre-saved model weights. When executing multiple runs,
# mount a volume to this path to avoid downloading the model
# weights multiple times.
cache_dir: .cache
rl:
policy_name: MlpPolicy
n_steps: 100000 # Total number of simulation steps to be collected.
n_envs_per_worker: 2 # Number of environments per worker (GPU)
episode_length: 200 # Desired episode length
learning_starts: 100 # Number of env steps to collect before training
train_freq: 200 # Number of collected env steps between training iterations
batch_size: 64 # SAC buffer sample size per gradient step
gradient_steps: 1 # Number of samples to collect from the buffer per training step
tau: 0.005 # SAC target network update rate
gamma: 0.99 # SAC discount factor
learning_rate: 3e-4 # SAC optimizer learning rate
logging:
checkpoint_freq: 10000 # Number of env steps between checkpoints
video_freq: 10000 # Number of env steps between videos
Configuration used for CartPole-v1:
env_name: CartPole-v1 # RL environment name
base_path: runs/training # Base path to save logs and checkpoints
seed: 42 # Seed for reproducibility
description: CartPole training using CLIP reward
tags: # Wandb tags
- training
- cartpole
- CLIP
reward:
name: clip
pretrained_model: ViT-g-14/laion2b_s34b_b88k # CLIP model name
# pretrained_model: CLIP-ViT-bigG-14-laion2B-39B-b160k # CLIP model name
# CLIP batch size
# per synchronous inference step.
# Batch size must be divisible by n_workers (GPU count)
# so that it can be shared among workers, and must be a divisor
# of n_envs * episode_length so that all batches can be of the
# same size (no support for variable batch size as of now.)
batch_size: 1600
alpha: 0.5 # Alpha value of Baseline CLIP (CO-RELATE)
target_prompts: # Description of the goal state
- a pole vertically upright on top of the cart
baseline_prompts: # Description of the environment
- a pole
# Path to pre-saved model weights. When executing multiple runs,
# mount a volume to this path to avoid downloading the model
# weights multiple times.
cache_dir: .cache
rl:
policy_name: MlpPolicy
n_steps: 100000 # Total number of simulation steps to be collected.
n_envs_per_worker: 2 # Number of environments per worker (GPU)
episode_length: 200 # Desired episode length
learning_starts: 100 # Number of env steps to collect before training
train_freq: 200 # Number of collected env steps between training iterations
batch_size: 64 # SAC buffer sample size per gradient step
gradient_steps: 1 # Number of samples to collect from the buffer per training step
tau: 0.005 # SAC target network update rate
gamma: 0.99 # SAC discount factor
learning_rate: 3e-4 # SAC optimizer learning rate
logging:
checkpoint_freq: 10000 # Number of env steps between checkpoints
video_freq: 10000 # Number of env steps between videos
Could you please provide some example configs so I could reproduce your result successfully?
I've successfully set up the environment following the repository instructions, but I'm having trouble reproducing the results mentioned in the paper. (which means that the model learns nothing using your ClipReward)
Clip Model used: ViT-g-14/laion2b_s34b_b88k
I've tested with two environments:
Configuration used for Humanoid-v4:
Configuration used for CartPole-v1:
Could you please provide some example configs so I could reproduce your result successfully?